0% found this document useful (0 votes)
16 views

Lecture Notes 3 Perceptron

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Lecture Notes 3 Perceptron

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

3.

PERCEPTRON
Consider the problem in supervised learning. In this case, the data is given as a set of pairs

D = {(x(1), y (1)), (x(2), y (2)), . . ., (x(n), y (n))}


x(i) = input = independent variable = predictor = attribute
y (i) = output = dependent variable = response = target

Goal: we want to learn the underlying relationship (hypothesis) between x and y, so that next
time we are given an instance of an input x, we can predict its corresponding output y.

• Typically, x ∈ Rd is a vector of d components


 
x1
 x2 
x= 
 ... 
xd

More precisely, x might be a customer, patient, image, text, song, etc., and we need to
represent it numerically

ϕ(x) = feature representation ∈ Rd

• In this section, we will study a binary classification problem where we are trying to separate
two classes of data instances (data points, examples, samples). We assume y ∈ {+1, −1},
denoting the labels for data instances.

https://www.v7labs.com/blog/semi-supervised-learning-guide

1
• To create a learning algorithm, we need a hypothesis class H which is a set of all possible
functions (hypotheses) x 7→ y. More precisely,

y = h(x; θ)

where θ denotes parameters.


x→ h →y
The goal is to find h that agrees with the given data in D and we hope that h will perform
well in the future.

• Next, we define what makes one hypothesis better than another. We define the loss (cost,
error) function for a single data instance (x(i) , y (i)) that measures the error of predicting the
class label ŷ (i) := h(x(i)) when the actual true class label is y (i) . We denote the loss by

L(ŷ (i), y (i))

where

ŷ (i) ∈ {+1, −1} predicted label


(i)
y ∈ {+1, −1} actual true label

Objective: have a small loss on the new data


Proxy: have a small loss on the given training data D = {(x(1), y (1)), (x(2), y (2)), . . . , (x(n), y (n))}
We minimize the loss on the training data
n
1X
Etrain (h) = L(h(x(i)), y (i)) training set error
n
i=1

and we hope that the loss on the test data (additional data that the algorithm has not seen)
0 0
{(x(n+1), y (n+1)), (x(n+2), y (n+2)), . . ., (x(n+n ) , y (n+n ) )} will also be small

n+n 0
1 X
Etest (h) = 0 L(h(x(i)), y (i)) test error
n
i=n+1

• How do we come up with a learning algorithm?

D → Learning Algorithm(H) → h

– be a clever human
– use optimization

2
• In this section, we consider linear classifiers, i.e., we assume that H is a class of all possible
linear separators.

https://automaticaddison.com/linear-separability-and-the-xor-problem/

In particular, we consider a hypothesis of the form

+1, w T x + b > 0

T
h(x; w, b) = sign(w x + b) =
−1, w T x + b < 0

where w ∈ Rd and b ∈ R.
Recall  
x1
d
 x2  X
w T x = [w1 w2 . . . wd ]   = w1 x1 + w2 x2 + . . . + wd xd =
 ...  wj xj
j=1
xd

The values of vector w are called weights and b is called a bias.

• The algorithm computes the weighted sum of the inputs and if this weighted sum exceeds
some threshold (specifically, −b), then the predicted label is positive, and if the weighted sum
is less than this threshold, the predicted label is negative.

• For simplicity, from now on, we consider the case d = 2 with two inputs (x1 , x2 ) ∈ R2 . We
can also visualize the above hypothesis as a single neuron.

Given an input (x1 , x2 ), the predicted output ŷ is found using two steps:

1. the quantity z (pre-activation) is calculated by


2
X
z = w1 x1 + w2 x2 + b = wj xj + b
j=1

3
2. the predicted output ŷ is calculated by applying the activation function φ as

ŷ = φ(z).

In this case, the activation function is the ”sign” function



 +1, z > 0
φ(z) = 0, z = 0,
−1, z < 0

https://en.wikipedia.org/wiki/Sign function

• To simplify the notation we define

x̄ = (x1 , x2 , 1)
w̄ = (w1 , w2 , b)

Notice that then

z = wT x + b
= w1 x1 + w2 x2 + b
= w̄ T x̄
3
X
= wixi
i=1

4
• A single artificial neuron is a model of a biological neuron and is a building block for artificial
neural networks.

https://towardsdatascience.com/the-differences-between-artificial-and-biological-neural-networks-a8b46db828b7

• Note that the separator (decision boundary) is of the form

w1 x1 + w2 x2 + b = 0
w1 b
x2 = − x1 −
w2 w2

which is the equation of the line in the (x1 , x2 )-plane.

• Perceptron

Given the data set


D = {(x(1), y (1)), (x(2), y (2)), . . ., (x(n), y (n))}
to find the linear classifier we need to find the model parameters, i.e., the weights w1 , w2 and
the bias b. Perceptron is one of the methods for finding those parameters.

5
perceptron (D, T, η)
w̄ = 0̄ initialize the parameter w̄ = (w1, w2 , b) to be the zero vector
for t = 1 to T
for i = 1 to n
ŷ (i) = φ(w̄ T x̄(i))
w̄ := w̄ − η(ŷ (i) − y (i) )x̄(i)
return w̄

– Here, T is the number of iterations and η is the learning rate of the ML model (typically,
a number between 0 and 1). Both T and η are hyperparameters.
– Let us try to understand the update rule for w̄ when we are at a data instance (x(i) , y (i))

w̄new = w̄old − η(ŷ (i) − y (i))x̄(i)

Recall that y (i) is the actual true class label, while ŷ (i) := h(x(i)) is the predicted class
label computed by our model.

∗ If our algorithm did not make a mistake, then y (i) = ŷ (i). In that case, w̄new = w̄old.
∗ If our algorithm misclassified x(i), there are two possibilities:

Case 1: y (i) = 1 and ŷ (i) = −1


This means that x(i) is a positive example and it is classified as negative. In other
words, this means that the algorithm claims

w̄ T x̄(i) < 0,

while actually w̄ T x̄(i) > 0. The rule says to modify w̄ so that

w̄ = w̄ + 2η x̄(i).

Note that since our algorithm claims w̄ T x̄(i) = kw̄kkx̄(i)k cos θ < 0, it means that
the angle between w̄ and x̄(i) is more than 90. By modifying w̄, we ensure that the
angle between updated w̄ and x̄(i) is less than 90 as it should be.

Case 2: y (i) = −1 and ŷ (i) = 1


This means that x(i) is a negative example and it is classified as positive. This
means that the algorithm claims

w̄ T x̄(i) > 0,

while actually w̄ T x̄(i) < 0. The rule says to modify w̄ so that

w̄ = w̄ − 2η x̄(i).

Note that since our algorithm claims w̄ T x̄(i) = kw̄kkx̄(i)k cos θ > 0, it means that
the angle between w̄ and x̄(i) is less than 90. By modifying w̄, we ensure that the
angle between updated w̄ and x̄(i) is more than 90 as it should be.

6
– If D is linearly separable, the percentron is guaranteed to converge and to produce a
classifier. If D is not linearly separable, the perceptron will not converge (see the proof
in [2], pages 18-20).
– Note that the update rule states
 
    (i)
w1 w1 x1
 w2  =  w2  − η(ŷ (i) − y (i)) 
 x(i)

2

b b 1

or, in expanded form,


(i)
w1 = w1 − η(ŷ (i) − y (i) )x1
(i)
w2 = w2 − η(ŷ (i) − y (i) )x2
b = b − η(ŷ (i) − y (i))

Python code: Lecture 3 Perceptron.ipynb

Homework 1:

• We mentioned that perceptron converges if the data is linearly separable. Try sklearn per-
ceptron model for versicolor and virginica, with sepal length and petal length. What do you
observe?

• We created My Perceptron class for only 2 inputs. Extend this code for 3 inputs. Investigate
the iris data set and choose 3 features to classify setosa and versicolor using your code. Notice
that you cannot easily plot the decision boundary now since the data is 3-dimensional, but you
can still compare the actual and the predicted labels to see how your algorithm is performing.

• Try to generalize My Perceptron code so it could be used for any number of inputs. (Hint:
Recall, that for a list w we can use w[-1] and w[:-1] to access the last value in the list and
all the values expect the very last value. Also, use np.dot, NumPy dot product, to compute
the pre-activation value of z.)

References and Reading Material:

[1] Hands-On Machine Learning with Scikit Learn, Keras & TensorFlow, Geron (pages 279-288)

[2] MIT notes:


2-LinearClassifiers.pdf
3-Perceptron.pdf (the proof of Perceptron convergence on pages 18-20 is optional)

https://openlearninglibrary.mit.edu/courses/course-v1:MITx+6.036+1T2019/course/

You might also like