Week 3 Lecture Notes
Week 3 Lecture Notes
ML:Logistic Regression
Now we are switching from regression problems to classi cation problems. Don't be confused by the name "Logistic Regression"; it is named
that way for historical reasons and is actually an approach to classi cation problems, not regression problems.
y∈{0,1}
Where 0 is usually taken as the "negative class" and 1 as the "positive class", but you are free to assign any representation to it.
We're only doing two classes for now, called a "Binary Classi cation Problem."
One method is to use linear regression and map all predictions greater than 0.5 as a 1 and all less than 0.5 as a 0. This method doesn't work
well because classi cation is not actually a linear function.
Hypothesis Representation
0 ≤ h θ (x) ≤ 1
Our new form uses the "Sigmoid Function," also called the "Logistic Function":
T
hθ (x) = g(θ x)
T
z = θ x
1
g(z) =
−z
1 + e
The function g(z), shown here, maps any real number to the (0, 1) interval, making it useful for transforming an arbitrary-valued function into a
function better suited for classi cation. Try playing with interactive plot of sigmoid function : (https://www.desmos.com/calculator/bgontvxotm).
We start with our old hypothesis (linear regression), except that we want to restrict the range to 0 and 1. This is accomplished by plugging θ T x
into the Logistic Function.
hθ will give us the probability that our output is 1. For example, hθ (x) = 0.7 gives us the probability of 70% that our output is 1.
Our probability that our prediction is 0 is just the complement of our probability that it is 1 (e.g. if probability that it is 1 is 70%, then the
probability that it is 0 is 30%).
Decision Boundary
In order to get our discrete 0 or 1 classi cation, we can translate the output of the hypothesis function as follows:
hθ (x) ≥ 0.5 → y = 1
The way our logistic function g behaves is that when its input is greater than or equal to zero, its output is greater than or equal to 0.5:
g(z) ≥ 0.5
when z ≥ 0
Remember.-
https://www.coursera.org/learn/machine-learning/resources/Zi29t 1/7
4/11/2018 Coursera | Online Courses From Top Universities. Join for Free | Coursera
0
z = 0, e = 1 ⇒ g(z) = 1/2
−∞
z → ∞, e → 0 ⇒ g(z) = 1
∞
z → −∞, e → ∞ ⇒ g(z) = 0
T
hθ (x) = g(θ x) ≥ 0.5
T
when θ x ≥ 0
T
θ x ≥ 0 ⇒ y = 1
T
θ x < 0 ⇒ y = 0
The decision boundary is the line that separates the area where y = 0 and where y = 1. It is created by our hypothesis function.
Example:
⎡ 5 ⎤
⎢ ⎥
θ =
⎢ −1 ⎥
⎣ 0 ⎦
y = 1 if 5 + (−1)x 1 + 0x 2 ≥ 0
5 − x1 ≥ 0
−x 1 ≥ −5
x1 ≤ 5
In this case, our decision boundary is a straight vertical line placed on the graph where x1 = 5 , and everything to the left of that denotes y = 1,
while everything to the right denotes y = 0.
Again, the input to the sigmoid function g(z) (e.g. θ T X ) doesn't need to be linear, and could be a function that describes a circle (e.g.
z = θ 0 + θ 1 x + θ 2 x ) or any shape to t our data.
2 2
1 2
Cost Function
We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy, causing
many local optima. In other words, it will not be a convex function.
m
1
(i) (i)
J(θ) = Cost(hθ (x ), y )
m ∑
i=1
The more our hypothesis is o from y, the larger the cost function output. If our hypothesis is equal to y, then our cost is 0:
https://www.coursera.org/learn/machine-learning/resources/Zi29t 2/7
4/11/2018 Coursera | Online Courses From Top Universities. Join for Free | Coursera
If our correct answer 'y' is 0, then the cost function will be 0 if our hypothesis function also outputs 0. If our hypothesis approaches 1, then the
cost function will approach in nity.
If our correct answer 'y' is 1, then the cost function will be 0 if our hypothesis function outputs 1. If our hypothesis approaches 0, then the cost
function will approach in nity.
Note that writing the cost function in this way guarantees that J(θ) is convex for logistic regression.
Notice that when y is equal to 1, then the second term (1 − y) log(1 − hθ (x)) will be zero and will not a ect the result. If y is equal to 0, then
the rst term −y log(hθ (x)) will be zero and will not a ect the result.
h = g(Xθ)
1 T T
J(θ) = ⋅ (−y log(h) − (1 − y) log(1 − h))
m
Gradient Descent
Repeat {
∂
θj := θj − α J(θ)
∂θj
Repeat {
m
α (i) (i) (i)
θj := θj − (hθ (x ) − y )x
j
m ∑
i=1
Notice that this algorithm is identical to the one we used in linear regression. We still have to simultaneously update all values in theta.
α T
θ := θ − X (g(Xθ) − y⃗)
m
First calculate derivative of sigmoid function (it will be useful while nding partial derivative of J(θ)):
′ ′ ′ ′ ′
−x −x −x −x −x
′
1 −(1 + e ) −1 − (e ) 0 − (−x) (e ) −(−1)(e ) e
σ(x) = = = = = =
−x
(1 + e ) −x 2 −x 2 −x 2 −x 2 −x 2
(1 + e ) (1 + e ) (1 + e ) (1 + e ) (1 + e )
−x −x −x
1 e +1 − 1 + e 1 + e 1
= = σ(x) = σ(x) − = σ(x)(1 − σ(x))
( 1 + e−x )( 1 + e−x ) ( 1 + e
−x
) ( 1 + e−x 1 + e
−x
)
https://www.coursera.org/learn/machine-learning/resources/Zi29t 3/7
4/11/2018 Coursera | Online Courses From Top Universities. Join for Free | Coursera
m
∂ ∂ −1
(i) (i) (i) (i)
J(θ) = [y log(hθ (x )) + (1 − y )log(1 − hθ (x ))]
∂θj ∂θj m ∑
i=1
m
1 (i)
∂ (i) (i)
∂ (i)
= − y log(hθ (x )) + (1 − y ) log(1 − hθ (x ))
m ∑[ ∂θj ∂θj ]
i=1
∂ ∂
m ⎡ y(i) hθ (x
(i)
) (1 − y
(i)
) (1 − hθ (x
(i)
)) ⎤
1 ⎢ ∂θj ∂θj
⎥
= − +
⎢ ⎥
m ∑ hθ (x (i) ) 1 − hθ (x (i) )
i=1 ⎣ ⎦
∂ T ∂ T
m ⎡ y(i) σ(θ x
(i)
) (1 − y
(i)
) (1 − σ(θ x
(i)
)) ⎤
1 ⎢ ∂θj ∂θj
⎥
= − +
⎢ ⎥
m ∑ hθ (x
(i)
) 1 − hθ (x
(i)
)
i=1 ⎣ ⎦
∂ T ∂ T
m ⎡ y(i) h (x (i) )(1 − h (x (i) )) θ x
(i)
(1 − y
(i)
)hθ (x
(i)
)(1 − hθ (x
(i)
)) θ x
(i) ⎤
θ θ
1 ⎢ ∂θj ∂θj
⎥
= − −
⎢ ⎥
m ∑ hθ (x
(i)
) 1 − hθ (x
(i)
)
i=1 ⎣ ⎦
m
1 (i) (i) (i) (i) (i) (i)
= − y (1 − hθ (x ))x − (1 − y )hθ (x )x
[ j j ]
m ∑
i=1
m
1 (i) (i) (i) (i) (i)
= − [y (1 − hθ (x )) − (1 − y )hθ (x )]x
j
m ∑
i=1
m
1 (i) (i) (i) (i) (i) (i) (i)
= − [y − y hθ (x ) − hθ (x ) + y hθ (x )]x
j
m ∑
i=1
m
1 (i)
(i) (i)
= − [y − hθ (x )]x
j
m ∑
i=1
m
1 (i)
(i) (i)
= [hθ (x ) − y ]x j
m ∑
i=1
1 T
∇J(θ) = ⋅ X ⃗
⋅ (g(X ⋅ θ) − y )
m
Advanced Optimization
"Conjugate gradient", "BFGS", and "L-BFGS" are more sophisticated, faster ways to optimize θ that can be used instead of gradient descent. A.
Ng suggests not to write these more sophisticated algorithms yourself (unless you are an expert in numerical computing) but use the libraries
instead, as they're already tested and highly optimized. Octave provides them.
We rst need to provide a function that evaluates the following two functions for a given input value θ:
J(θ)
∂
J(θ)
∂θ j
Then we can use octave's "fminunc()" optimization algorithm along with the "optimset()" function that creates an object containing the options
we want to send to "fminunc()". (Note: the value for MaxIter should be an integer, not a character string - errata in the video at 7:30)
We give to the function "fminunc()" our cost function, our initial vector of theta values, and the "options" object that we created beforehand.
In this case we divide our problem into n+1 (+1 because the index starts at 0) binary classi cation problems; in each one, we predict the
probability that 'y' is a member of one of our classes.
https://www.coursera.org/learn/machine-learning/resources/Zi29t 4/7
4/11/2018 Coursera | Online Courses From Top Universities. Join for Free | Coursera
y ∈ {0, 1. . . n}
(0)
h (x) = P(y = 0|x; θ)
θ
(1)
h (x) = P(y = 1|x; θ)
θ
⋯
(n)
h (x) = P(y = n|x; θ)
θ
(i)
prediction = max(h (x))
θ
i
We are basically choosing one class and then lumping all the others into a single second class. We do this repeatedly, applying binary logistic
regression to each case, and then use the hypothesis that returned the highest value as our prediction.
ML:Regularization
The Problem of Over tting
High bias or under tting is when the form of our hypothesis function h maps poorly to the trend of the data. It is usually caused by a function
that is too simple or uses too few features. eg. if we take hθ (x) = θ 0 + θ 1 x1 + θ 2 x2 then we are making an initial assumption that a linear
model will t the training data well and will be able to generalize but that may not be the case.
At the other extreme, over tting or high variance is caused by a hypothesis function that ts the available data but does not generalize well to
predict new data. It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data.
This terminology is applied to both linear and logistic regression. There are two main options to address the issue of over tting:
2) Regularization
Cost Function
If we have over tting from our hypothesis function, we can reduce the weight that some of the terms in our function carry by increasing their
cost.
2 3 4
θ0 + θ1 x + θ2 x + θ3 x + θ4 x
We'll want to eliminate the in uence of θ 3 x and θ 4 x . Without actually getting rid of these features or changing the form of our hypothesis,
3 4
1 m (i) (i) 2 2 2
min θ ∑ (h θ (x ) − y ) + 1000 ⋅ θ + 1000 ⋅ θ
i=1 3 4
2m
We've added two extra terms at the end to in ate the cost of θ 3 and θ 4 . Now, in order for the cost function to get close to zero, we will have to
reduce the values of θ 3 and θ 4 to near zero. This will in turn greatly reduce the values of θ 3 x 3 and θ 4 x 4 in our hypothesis function.
1 m n
(i) (i) 2 2
minθ ∑ (hθ (x ) − y ) + λ ∑ θj
[ i=1 j=1 ]
2m
The λ, or lambda, is the regularization parameter. It determines how much the costs of our theta parameters are in ated. You can visualize
the e ect of regularization in this interactive plot : https://www.desmos.com/calculator/1hexc8ntqp
Using the above cost function with the extra summation, we can smooth the output of our hypothesis function to reduce over tting. If lambda
is chosen to be too large, it may smooth out the function too much and cause under tting.
Gradient Descent
We will modify our gradient descent function to separate out θ 0 from the rest of the parameters because we do not want to penalize θ 0 .
https://www.coursera.org/learn/machine-learning/resources/Zi29t 5/7
4/11/2018 Coursera | Online Courses From Top Universities. Join for Free | Coursera
Repeat {
m
1 (i)
(i) (i)
θ0 := θ0 − α (hθ (x ) − y )x
0
m ∑
i=1
m
1 (i)
λ
(i) (i)
θj := θj − α (hθ (x ) − y )x + θj j ∈ {1, 2...n}
j
∑
[( m ) m ]
i=1
With some manipulation our update rule can also be represented as:
The rst term in the above equation, 1 − α will always be less than 1. Intuitively you can see it as reducing the value of θ j by some amount on
λ
every update.
Notice that the second term is now exactly the same as it was before.
Normal Equation
Now let's approach regularization using the alternate method of the non-iterative normal equation.
To add in regularization, the equation is the same as our original, except that we add another term inside the parentheses:
−1
T T
θ = (X X + λ ⋅ L) X y
⎡0 ⎤
⎢ ⎥
1
⎢ ⎥
where L = ⎢ 1 ⎥
⎢ ⎥
⎢ ⋱ ⎥
⎣ 1⎦
L is a matrix with 0 at the top left and 1's down the diagonal, with 0's everywhere else. It should have dimension (n+1)×(n+1). Intuitively, this is
the identity matrix (though we are not including x0 ), multiplied with a single real number λ.
Recall that if m ≤ n, then X T X is non-invertible. However, when we add the term λ⋅L, then X T X + λ⋅L becomes invertible.
Cost Function
m λ n 2
1 (i) (i) (i) (i)
J(θ) = − ∑
i=1
[y log(hθ (x )) + (1 − y ) log(1 − hθ (x ))] + ∑ θj
m
2m j=1
Note Well: The second sum, ∑j=1 means to explicitly exclude the bias term, θ 0 . I.e. the θ vector is indexed from 0 to n (holding n+1
n 2
θ
j
values, θ 0 through θ n ), and this sum explicitly skips θ 0 , by running from 1 to n, skipping 0.
Gradient Descent
Just like with linear regression, we will want to separately update θ 0 and the rest of the parameters because we do not want to regularize θ 0 .
Repeat {
m
1 (i)
(i) (i)
θ0 := θ0 − α (hθ (x ) − y )x
0
m ∑
i=1
m
1 (i)
λ
(i) (i)
θj := θj − α (hθ (x ) − y )x + θj j ∈ {1, 2...n}
j
∑
[( m ) m ]
i=1
This is identical to the gradient descent function presented for linear regression.
https://www.coursera.org/learn/machine-learning/resources/Zi29t 6/7
4/11/2018 Coursera | Online Courses From Top Universities. Join for Free | Coursera
Constant Feature
As it turns out it is crucial to add a constant feature to your pool of features before starting any training of your machine. Normally that feature
is just a set of ones for all your training examples.
Below are some insights to explain the reason for this constant feature. The rst part draws some analogies from electrical engineering
concept, the second looks at understanding the ones vector by using a simple machine learning example.
Electrical Engineering
From electrical engineering, in particular signal processing, this can be explained as DC and AC.
The initial feature vector X without the constant term captures the dynamics of your model. That means those features particularly record
changes in your output y - in other words changing some feature Xi where i ≠ 0 will have a change on the output y. AC is normally made out of
many components or harmonics; hence we also have many features (yet we have one DC term).
The constant feature represents the DC component. In control engineering this can also be the steady state.
Interestingly removing the DC term is easily done by di erentiating your signal - or simply taking a di erence between consecutive points of a
discrete signal (it should be noted that at this point the analogy is implying time-based signals - so this will also make sense for machine
learning application with a time basis - e.g. forecasting stock exchange trends).
Another interesting note: if you were to play and AC+DC signal as well as an AC only signal where both AC components are the same then they
would sound exactly the same. That is because we only hear changes in signals and Δ(AC+DC)=Δ(AC).
Let's assume a simple model which has features that are directly proportional to the expected price i.e. if feature Xi increases so the expected
price y will also increase. So as an example we could have two features: namely the size of the house in [m2], and the number of rooms.
When you train your machine you will start by prepending a ones vector X0 . You may then nd after training that the weight for your initial
feature of ones is some value θ0. As it turns, when applying your hypothesis function hθ (X) - in the case of the initial feature you will just be
multiplying by a constant (most probably θ0 if you not applying any other functions such as sigmoids). This constant (let's say it's θ0 for
argument's sake) is the DC term. It is a constant that doesn't change.
But what does it mean for this example? Well, let's suppose that someone knows that you have a working model for housing prices. It turns out
that for this example, if they ask you how much money they can expect if they sell the house you can say that they need at least θ0 dollars (or
rands) before you even use your learning machine. As with the above analogy, your constant θ0 is somewhat of a steady state where all your
inputs are zeros. Concretely, this is the price of a house with no rooms which takes up no space.
However this explanation has some holes because if you have some features which decrease the price e.g. age, then the DC term may not be an
absolute minimum of the price. This is because the age may make the price go even lower.
Theoretically if you were to train a machine without a ones vector fAC (X), it's output may not match the output of a machine which had a ones
vector fDC (X). However, fAC (X) may have exactly the same trend as fDC (X) i.e. if you were to plot both machine's output you would nd that
they may look exactly the same except that it seems one output has just been shifted (by a constant). With reference to the housing price
problem: suppose you make predictions on two houses houseA and houseB using both machines. It turns out while the outputs from the two
machines would di erent, the di erence between houseA and houseB's predictions according to both machines could be exactly the same.
Realistically, that means a machine trained without the ones vector fA C could actually be very useful if you have just one benchmark point. This
is because you can nd out the missing constant by simply taking a di erence between the machine's prediction an actual price - then when
making predictions you simply add that constant to what even output you get. That is: if housebenchmark is your benchmark then the DC
component is simply price(housebenchmark ) − fAC (f eatures(housebenchmark ))
A more simple and crude way of putting it is that the DC component of your model represents the inherent bias of the model. The other
features then cause tension in order to move away from that bias position.
Kholofelo Moyaba
A simpler approach
A "bias" feature is simply a way to move the "best t" learned vector to better t the data. For example, consider a learning problem with a
single feature X1 . The formula without the X0 feature is just theta1 ∗ X1 = y. This is graphed as a line that always passes through the origin,
with slope y/theta. The x0 term allows the line to pass through a di erent point on the y axis. This will almost always give a better t. Not all best
t lines go through the origin (0,0) right?
Joe Cotton
https://www.coursera.org/learn/machine-learning/resources/Zi29t 7/7