Logistic Regression Classifier - Conceptual Guide

Logistic Regression Classifier - Conceptual Guide Caglar Subasi, 2019/02
LOGISTIC REGRESSION CLASSIFIER
How It Works
A Step-by-Step Guide
Conceptual
Logistic Regression is a ‘Statistical Learning’ technique categorized in ‘Supervised’ Machine
Learning (ML) methods dedicated to ‘Classification’ tasks. It has gained a tremendous
reputation for last two decades especially in financial sector due to its prominent ability of
detecting defaulters. A contradiction appears when we declare a classifier whose name contains
the term ‘Regression’ is being used for classification, but this is why Logistic Regression
magical: using a linear regression equation to produce discrete binary outputs. And yes, it is
also categorized in ‘Discriminative Models’ subgroup of ML methods like Support Vector1
Machines and Perceptron where all use linear equations as a building block and attempts to
maximize the quality of output on a training set.
1
Complementary subgroup is called ‘Generative Models’ has members like ‘Naîve Bayes’ and ‘Fisher’s
Linear Discriminants’.
1

Figure-1: From Decision Function to Decision Boundary
We will follow the below guide throughout the article in the given order. As can be understood
from the content, this article is just a conceptual manual intending to clarify technical workflow of
Logistic Regression Classifier. After a long searching and reading period, I realized that there is
a prolificacy for empirical studies but a striking scarcity of theoretical aspects of Machine
Learning implementations. This is maybe why Yuval Noah Harari states in his famous ‘best
seller’ book ‘Homodeus - A Brief History of Tomorrow’ that:
“...In fact modernity is a surprisingly simple deal. The entire
contract can be summarised in a single phrase: humans agree to
give up ‘meaning’ in exchange for ‘power’...”
With this article, my aim is to create a complete guide which provides the inner meaning of each
step existing in Logistic Regression workflow. Hope you enjoy...
Table of Content 1
A. Data Structure 3
B. Experiment Design 3
C. Decision/Activation Function 4
D. Objective Function 6
E. Optimizing Objectives 8
E.1. Getting the Gradient Equation (Differentiation) 8
E.1.1. Coin Experiment, ‘Average Learning’ 8
E.1.2. Credit Scoring Experiment, ‘Stochastic Learning’ 9
E.2. Finding Maxima-Minima 12
F. Further Readings 14
2

G. References 15
A. Data Structure
Inputs are continuous feature-vectors ( ’s) of length , where and .
So, the input matrix is which contains number of inputs (data points) each contains
number of features. Inputs can be illustrated as a matrix like below.
And output is discrete and binary variable, such that .
B. Experiment Design
Let’s we have a ‘flipping/tossing a coin’ experiment. Supposing the coin is a fair one brings us
‘equally likely’ outcomes of ‘Head’ and ‘Tail’. That is the ‘posterior’ probabilities are:
where is an input matrix and contains all trials/observations and their features. Since in this
‘flipping coin’ experiment does not include any independent variable (feature), our input matrix
includes only the trails we made, that is it will be a vector of ‘ ’ where is just
symbolizing the first trial rather than an concrete input.
But if we replace the experiment with a ‘Credit Scoring’ one, our outcome universe will still be
discrete and binary (‘Default’ and ‘Not’), however the input vector returns to a matrix again since
some features has shown above!
3

Another radical change expecting us after shifting the experiment is the ‘uncertainty’ affecting
the fairness that we assume for coins. Like unfair coins, credits are hosting different chances to
be defaulted due to the different characteristics of obligators. So, our ‘posteriors’ will not be
‘equally likely’ anymore.
C. Decision/Activation Function
Unfairness described in previous part brings the problem of uncertainty in the process and the
necessity of anticipation. As a ‘Supervised-Classification’ method, Logistic Regression helps
us to converge those ‘uncertain’ posteriors with a differentiable ‘decision function’ drawn in2
Figure-2 below.
Figure-2: Logistic Sigmoid Activation Function
This function is called as ‘logistic function’ or ‘sigmoid function’ and helps us to shrink real
valued continuous inputs into a range of (0,1) which is gloriously useful while dealing with
probabilities! With the help of ‘logistic function’, we can write our posteriors like below.
where is a function consisting our features ( ) and their corresponding
weights/coefficients ( ) in a linear form shown below.
2
Since the curve it creates is continuous. But a ‘Signum Function - sign(x)’ is not, since it is discrete.
4

where and is representing the ‘random error process - noise ’ inevitably3
happening in the data generating process.
By using the posterior equation above, we can rewrite the estimation function in the form
of ‘posterior probability’ as shown below.
which is famously known as the ‘log of odds ’ ratio. One can realize its usefulness while trying4
to interpreting the coefficients of linear regression function .5
Using ‘logarithmic’ transformation helps our learning mechanism in 3 main aspects:
1. It makes the values more ‘normalized’ (big values to smaller and vice versa).
Normalization (scaling) help us to reach more consistent (wrt magnitudes) coefficients
that is none of them affect outcomes in a dominant way!
2. It makes operations inside of it more easier to perform (multiplications → summations,
divisions → subtractions, exponents → multiplications)
3. It creates a curve/hyperplane (value sequence) which has ‘monotonicity’. Functions
which are increasing or decreasing monotonically:
3.1. can be traversed by an ‘optimization solver ’ more efficiently with respect to time6
since they do not consist ‘local minima/maxima’ and
3.2. can be a representative of the original function (not scaled one) since optimal
solution for the logarithmic function will be identical with the optimal solution for
the original function.
Curves of different based ‘logarithmic functions’ can be found in Figure-3 below. As can be
seen, all of them are ‘monotonic’ and cutting x-axis from the same point ( ). In Logistic
Regression case, we unexceptionally use natural (10) as the base of our logarithmic function.
3
Which causes the ‘Bias’ in fitting process. Noise is natural procedure of data generating process. So,
even if we use all dataset we have (using the average hypothesis set (g_bar(x)), there is always a
‘approximation’ limit to the unknown target function.
4
Values produced by this ratio will be used to build ‘score bands’ which is the final part of the ‘Credit
Scoring Model’ building process.
5
Looking for the answer of the question: “How many does one unit change in a feature affect the target?”
6
e.g. Stochastic Gradient Descent or Quadratic Programming
5

Figure-3: Logarithmic Functions with various bases
Passing through (where ) helps us to make more logical transformations in the way
of interpreting the ‘Event’ and ‘No-Event’ (log odds) ratio.
So, for the cases that we stay in the positive side of the function,
otherwise we pass the negative side. This makes a lot of sense while labeling observations in
the outcome space.
D. Objective Function
Like in other Machine Learning Classifiers , Logistic Regression has an ‘objective function’7
which tries to maximize ‘likelihood function’ of the experiment . This approach is known as8
‘Maximum Likelihood Estimation - MLE’ and can be written mathematically as follows.
where
● the output ,
● is the posterior probability which is equal to and
7
Naîve Bayes, Ensembled Trees, SVM, Neural Networks, etc.
8
Or minimizing the negative of ‘log likelihood function’ would be a tricky movement depend on the
optimization tool we have. If the aim is minimizing objective function can be called as the ‘Loss/Cost
Function’.
6

● parameters is the vector of ‘weights/coefficients’ in
as we defined earlier. Before describing and optimizing this objective with respect to parameter
, it may better to shift ‘coin’ experiment in order to simplify remaining processes. So, the
‘objective function’ of ‘flipping a coin’ problem can be written in the below format.
where is the likelihood of success (let be Head), is independent Bernoulli random variable
( ) and the inner term is the ‘joint likelihood distribution’ function of the experiment. We
want to find the optimum value of in order to maximize this function. But how did we decide
that maximizing it will correspond our main goal which is getting high classification accuracy.
Same process performing in ‘Linear Regression’ is definitely clear, since choosing LSE as an
objective function surely and obviously brings the shortest distances between predictions ( )
and actual targets ( ).
To obtain the same clarity for Logistic Regression’s MLE case, we need to approach it in a
numerical manner. To do that, let’s assign miscellaneous values to likelihoods in the objective
function of coin experiment. Those likelihoods may or may not exhibit discordance with known
target values .
Examples Assigned Values Discordance Likelihood Function
1 Yes
2 No
3 No
4 Yes
We designed 4 examples that 2 of them has no discordance between and target, that is we
made a successful classification. So, when we compare the likelihood function returns, suitable
assignments of with respect to target produces higher likelihoods! This is why we choose to
maximize the ‘log likelihood function’ as an objective in Logistic Regression case above. More
formally, we can summarize this logic with in two steps given below.
7

❖ for samples labeled as ‘1’ we desire to estimate as close to 1 as possible
❖ for samples labeled as ‘0’ we desire to estimate as close to 1 as possible
E. Optimizing Objectives
E.1. Getting the Gradient Equation (Differentiation)
E.1.1. Coin Experiment, ‘Average Learning’
We can join likelihood functions:
● for where is the number of successful (Head) occurrence and
● for where is the number of fail (Tail) occurrence
of coin experiment as follows:
Taking logarithm of the joint likelihood function, we get the log-likelihood:
which can be written in summation form:
Taking the derivative of this function with respect to and equalizing it to zero will bring us the
optimal value of which maximizes log-likelihood.
Distribute the summation operation:
8

Multiplying through by we get:
Upon distributing, we see that two of the resulting terms cancel each other out:
Leaving us with:
Solve for :
See how we proved that the ‘average’ will bring the best estimation performance (maximum
likelihood) in the case of no inputs (explanatory variable) exist.
E.1.2. Credit Scoring Experiment, ‘Stochastic Learning’
How about taking the ’s into account like in a real Machine Learning experiment? Calculating
‘Gradients/Slopes’ of each observation in the input vector takes long time and generally is an
‘out-of-memory’ type operation when the input set is substantially large. Using a ‘stochastic’
process brings a remedy for that inefficiency by randomly selecting feature vectors from the data
set and calculating its gradient/slope only. This process repeats selecting random data points
and comparing their slopes until a convergence with respect to slope occurs. This procedure is
famously known as the ‘Stochastic Gradient Descent/Ascent - SGD/SGA’ optimization. A
symbolic travel of a SGD algorithm is illustrated in Figure-4 below.
9

Figure-4: Gradient Descent Illustration
To apply SGD/SGA, we need to shift back to the ‘Credit Scoring’ experiment which has the
objective (joint likelihood) function given below and to get its Gradient Equation first.
Below operations are the sequential steps of the finding ‘Gradient Equation’ of objective function
of ‘Credit Scoring’ experiment.
Step-1: Joint log-likelihood function
Step-2: Taking log converts to
Step-3: Substituting with
10

Step-4: Combining wrt
Step-5: Merging two ‘log’ terms in the square bracket
Step-6: Cancelling logarithm and exponential functions
Step-7: Final equation that we will take partial derivatives
Step-8: Partial derivatives
➔ Partial-1:
➔ Partial-2:
➔ Combine partials:
11

➔ Replace exponent term with corresponding conditional probability (posterior) term:
➔ Take as common:
which gives the final form of log-likelihood-gradient (Gradient Equation).
In summary, our goal is to find the optimum value of parameter which maximizes the
log-likelihood function of the ‘Credit Scoring’ experiment. Above steps provides us the
differentiated version of log-likelihood and it is expected that it will converge to local
maximum/minimum at the point where the ‘Gradient/Slope’ is zero.
So, in a regular optimization procedure algorithms will try to calculate vector of each data9
point which is a feature vector of dimension. Since calculating all coefficients is
computationally inefficient, SGD/SGA come to the stage and take our hands while walking on
the objective/loss curve to the bottom or peak.10
E.2. Finding Maxima-Minima
Since the final ‘Gradient Equation’ is in ‘transcendental’ form, that is contains non-algebraic
functions such as logarithm and exponent, it can not be solved directly (no closed form solution
exists). So we need approximation techniques, for example:
❖ Gradient Descent
❖ Newton-Raphson Method
In Machine Learning, we generally use Gradient Descent technique while trying to approximate
global maxima or minima of objective functions, because it has concrete advantages against
Newton-Raphson.
9
For example a ‘Gradient Descent’ algorithm which is not working ‘stochastically’ but sequentially trying
to compute all slopes on the objective/loss function curve/hyperplane.
10
Or generally a hyperplane like in the Figure-4
12

Newton-Raphson's method is a root finding algorithm that maximizes a function using the11
knowledge of its second derivative (Hessian Matrix). That can be faster when the second
derivative is known and easy to compute (like in Logistic Regression). However, the analytical12
expression for the second derivative is often complicated or intractable, requiring a lot of
computation.
On the other hand, Gradient Descent maximizes/minimizes a function using knowledge of its
first derivative only. It simply follows the steepest descent from the current point to the desired
hill or hole. This is like rolling a ball down the graph of loss function (like in Figure-4) until it
comes to rest. Since Gradient Descent uses first derivatives, it is configured to find local
maxima/minima but we need to get global ones. To handle this problem, we use a ‘stochastic’
approach which calculates gradients randomly for different points of the loss curve/hyperplane
and compares all local minima/maxima each other to get globals. This procedure is illustrated in
Figure-5 below.
Figure-5: Stochastic Gradient Descent Illustration
A standard Gradient Descent algorithm is defined as follows where is the ‘learning rate’ and
symbolizes the Gradient Equation.
11
It is called as ‘root-finding method’ because it tries to find a point x satisfying f'(x) = 0 by approximating
f' with a linear function g and then solving for the root of that function explicitly. The root of g is not
necessarily the root of f', but it is under many circumstances a good guess.
12
Help us to know the ‘concavity’ of the objective function surface.
13

In algorithm above, one takes steps proportional (learning rate) to the negative of the gradient
(or approximate gradient) of the function at the current point. If, instead, one takes steps
proportional to the positive of the gradient, one approaches to a local maximum of that function.
This procedure is then known as Gradient Ascent.
In order to get more technical knowledge about Gradient Descent algorithms, it would be better
to read Avinash Kadimisetty’s post in Towards Data Science and watch the video of
3BLUE1BROWNSERIES channel in youtube which are definitely the most informative sharings
in my opinion! In addition to it, if you desire to get informed about Newton-Raphson Method and
its applications in Logistic Regression, please watch the meticulously prepared video of
CodeEmporium’s channel in Youtube. All these sharings are listed in the references list at the
end of the article.
F. Further Readings
1. Sigmoid vs. ReLu: Using ‘Sigmoid’ as an activation function will bring some
disadvantages while training the model. For example, its first derivative is not monotonic
as shown below.
Figure-6: Derivative of Sigmoid Function
14

Figure-7: Sigmoid and ReLu Comparison
ReLu is the most popular activation function for NN type classifiers, nowadays. Detailed
explanations can be found from here.
2. Logistic Regression vs. Naîve Bayes: This is actually understanding the differences
between ‘Discriminative’ and ‘Generative’ models. Here exists a brief but an elegant
post.
G. References
➔ https://www.cs.cmu.edu/~mgormley/courses/10701-f16/schedule.html (Lecture-5)
➔ https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/pdfs/40%20LogisticRegressi
on.pdf
➔ http://kronosapiens.github.io/blog/2017/03/28/objective-functions-in-machine-learning.ht
ml
➔ https://towardsdatascience.com/gradient-descent-demystified-bc30b26e432a
➔ https://youtu.be/YMJtsYIp4kg
➔ https://newonlinecourses.science.psu.edu/stat414/node/191/
➔ https://datascience.stackexchange.com/questions/25444/advantages-of-monotonic-activ
ation-functions-over-non-monotonic-functions-in-neu
➔ https://stats.stackexchange.com/questions/253632/why-is-newtons-method-not-widely-u
sed-in-machine-learning
➔ https://stackoverflow.com/questions/12066761/what-is-the-difference-between-gradient-
descent-and-newtons-gradient-descent
➔ http://www.wikizero.biz/index.php?q=aHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2k
vTGluZWFyX2NsYXNzaWZpZXI
➔ https://en.wikipedia.org/wiki/Linear_classifier
➔ https://en.wikipedia.org/wiki/Logistic_regression
➔ https://www.youtube.com/watch?v=IHZwWFHWa-w&t=463s
➔ https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6
➔ https://sebastianraschka.com/faq/docs/naive-bayes-vs-logistic-regression.html
15

Logistic Regression Classifier - Conceptual Guide

More Related Content

What's hot

Similar to Logistic Regression Classifier - Conceptual Guide

Recently uploaded

Logistic Regression Classifier - Conceptual Guide