COMP3202 - Intro to Machine Learning
Linear Methods for Classification
Linear Models for Classification
For tasks where the response variable is categorical, we
need a method that models the posterior probabilities:
Pr(Y = k | X = x)
where k is the class of instance x.
Linear Models for Classification
If Pr(Y = k | X = x) is
linear in X then
– the decision boundaries
will be linear and
– we can use a linear
model.
Figure from The Elements of Statistical Learning by Hastie, Tibshirani and Friedman, 2009.
Why not use linear regression?
Pr(Y = k | X = x)
must be modelled with a
function that gives outputs
between 0 and 1 for all
values of X
Fig. from An Introduction to Statistical Learning: with Applications in R by James, Witten, Hastie, and Tibshirani, 2013.
From linear to logistic regression
e is the
Euler's number =
2.718281
Fig. from An Introduction to Statistical Learning: with Applications in R by James, Witten, Hastie, and Tibshirani, 2013.
Logistic Regression
We need to model the relationship between P(X)=Pr(Y=1|X=x) and X.
Consider using a linear model to represent the probabilities:
Using this equation, the output for very large or very small input values
could potentially be outside the range [0, 1]. (Why is this not sensible?)
To avoid this problem, we use a logistic function that will ensure the output
to be within the range (0, 1):
This function produces an S-shaped curve, that regardless of value of X,
produces a sensible output.
Logistic Regression
After a bit of manipulation of
We find that,
called the odds,
Values between 0 and ∞
By taking the logarithm of both sides:
called the log-odds/logit,
Finding the Coefficients
Likelihood function:
Maximum likelihood is a very general approach that is used to estimate
the 𝛽s which will maximize this likelihood function.
Any statistical package can be used to estimate the 𝛽s. (e.g. SGD,...)
Generalized likelihood function:
X
X
Logistic Regression
The Maximum likelihood is
used to estimates the 𝛽s.
Probability
of Y = 1
X
Fig. adapted from An Introduction to Statistical Learning: with Applications in R by James, Witten, Hastie, and Tibshirani, 2013.
Logistic Regression
● Logit is linear in X
● If βi is positive then increasing Xi will increase p(X)
● If βi is negative then increasing Xi will decrease p(X)
● Predict Y = 1 for any instance for which p(X) > threshold
● The decision boundary is the points for which the log-odds are
zero.
Example
Suppose we want to predict using logistic regression whether
an individual will default on their credit card payment on the
basis of annual income, monthly credit card balance and
student status
Example and Figure from An Introduction to Statistical Learning: with Applications in R by James, Witten, Hastie, and Tibshirani, 2013.
Making Predictions
Based on this Coefficients,
The default probability for an individual with a balance of $1,000 is:
Fig. from An Introduction to Statistical Learning: with Applications in R by James, Witten, Hastie, and Tibshirani, 2013.
Confounding
Suppose that we construct a model of the probability of default
using only the feature student status and the coefficient
associated with this feature is 0.4049.
However, when we add the features balance and income to the
model the coefficient associated with the feature student status
is negative. Why?
Confounding
● Interpretation:
○ A student is less risky than a non-student with the same credit card balance
● Confounding occurs when features are correlated
● Results of linear models significantly change depending on the features included.
● It is important to include (all) relevant features
Example and Figure from An Introduction to Statistical Learning: with Applications in R by James, Witten, Hastie, and Tibshirani, 2013.