Logistic Regression Classifier - Conceptual Guide Caglar Subasi, 2019/02
LOGISTIC REGRESSION CLASSIFIER
How It Works
A Step-by-Step Guide
Conceptual
Logistic Regression is a ‘Statistical Learning’ technique categorized in ‘Supervised’ Machine
Learning (ML) methods dedicated to ‘Classification’ tasks. It has gained a tremendous
reputation for last two decades especially in financial sector due to its prominent ability of
detecting defaulters. A contradiction appears when we declare a classifier whose name contains
the term ‘Regression’ is being used for classification, but this is why Logistic Regression
magical: using a linear regression equation to produce discrete binary outputs. And yes, it is
also categorized in ‘Discriminative Models’ subgroup of ML methods like Support Vector1
Machines and Perceptron where all use linear equations as a building block and attempts to
maximize the quality of output on a training set.
1
Complementary subgroup is called ‘Generative Models’ has members like ‘Naîve Bayes’ and ‘Fisher’s
Linear Discriminants’.
1
Logistic Regression Classifier - Conceptual Guide Caglar Subasi, 2019/02
Figure-1: From Decision Function to Decision Boundary
We will follow the below guide throughout the article in the given order. As can be understood
from the content, this article is just a ​conceptual manual intending to clarify ​technical workflow of
Logistic Regression Classifier. After a long searching and reading period, I realized that there is
a prolificacy for ​empirical studies but a striking scarcity of ​theoretical aspects of Machine
Learning implementations. This is maybe why Yuval Noah Harari states in his famous ‘best
seller’ book ‘Homodeus - A Brief History of Tomorrow’ that:
“...In fact modernity is a surprisingly simple deal. The entire
contract can be summarised in a single phrase: humans agree to
give up ‘meaning’ in exchange for ‘power’...”
With this article, my aim is to create a complete guide which provides the inner meaning of each
step existing in Logistic Regression workflow. Hope you enjoy...
Table of Content 1
A. Data Structure 3
B. Experiment Design 3
C. Decision/Activation Function 4
D. Objective Function 6
E. Optimizing Objectives 8
E.1. Getting the Gradient Equation (Differentiation) 8
E.1.1. Coin Experiment, ‘Average Learning’ 8
E.1.2. Credit Scoring Experiment, ‘Stochastic Learning’ 9
E.2. Finding Maxima-Minima 12
F. Further Readings 14
2
Logistic Regression Classifier - Conceptual Guide Caglar Subasi, 2019/02
G. References 15
A. Data Structure
Inputs are continuous feature-vectors ( ’s) of length , where and .
So, the input matrix is which contains number of inputs (data points) each contains
number of features. Inputs can be illustrated as a matrix like below.
And output is discrete and binary variable, such that .
B. Experiment Design
Let’s we have a ‘​flipping/tossing a coin​’ experiment. Supposing the coin is a ​fair one brings us
‘​equally likely​’ outcomes of ‘Head’ and ‘Tail’. That is the ‘​posterior​’ probabilities are:
where is an input matrix and contains all trials/observations and their features. Since in this
‘flipping coin’ experiment does not include any independent variable (feature), our input matrix
includes only the trails we made, that is it will be a vector of ‘ ’ where is just
symbolizing the first trial rather than an concrete input.
But if we replace the experiment with a ‘Credit Scoring’ one, our outcome universe will still be
discrete and binary (‘Default’ and ‘Not’), however the input vector returns to a matrix again since
some features has shown above!
3
Logistic Regression Classifier - Conceptual Guide Caglar Subasi, 2019/02
Another ​radical change expecting us after shifting the experiment is the ‘​uncertainty​’ affecting
the ​fairness that we assume for coins. Like unfair coins, credits are hosting different chances to
be defaulted due to the different characteristics of obligators. So, our ‘​posteriors​’ will not be
‘​equally likely​’ anymore.
C. Decision/Activation Function
Unfairness described in previous part brings the problem of uncertainty in the process and the
necessity of anticipation. As a ‘​Supervised-Classification​’ method, ​Logistic Regression helps
us to ​converge those ‘uncertain’ posteriors with a ​differentiable ‘​decision function​’ drawn in2
Figure-2 below.
Figure-2: Logistic Sigmoid Activation Function
This function is called as ‘​logistic function​’ or ‘​sigmoid function​’ and helps us to shrink real
valued continuous inputs into a range of (0,1) which is gloriously useful while dealing with
probabilities! With the help of ‘logistic function’, we can write our posteriors like below.
where is a function consisting our features ( ) and their corresponding
weights/coefficients ( ) in a ​linear form​ shown below.
2
Since the curve it creates is continuous. But a ‘​Signum Function​ - sign(x)’ is not, since it is discrete.
4
Logistic Regression Classifier - Conceptual Guide Caglar Subasi, 2019/02
where and is representing the ‘​random error process - noise ’ inevitably3
happening in the data generating process.
By using the posterior equation above, we can rewrite the estimation function in the form
of ‘​posterior probability​’ as shown below.
which is famously known as the ‘​log of odds ’ ratio. One can realize its usefulness while trying4
to interpreting the coefficients of linear regression function .5
Using ‘logarithmic’ transformation helps our learning mechanism in 3 main aspects:
1. It makes the values more ‘normalized’ (big values to smaller and vice versa).
Normalization (scaling) help us to reach more consistent (wrt magnitudes) coefficients
that is none of them affect outcomes in a dominant way!
2. It makes operations inside of it more easier to perform (multiplications → summations,
divisions → subtractions, exponents → multiplications)
3. It creates a curve/hyperplane (value sequence) which has ‘​monotonicity​’. Functions
which are increasing or decreasing monotonically:
3.1. can be traversed by an ‘​optimization solver ’ more efficiently with respect to time6
since they do not consist ‘local minima/maxima’ and
3.2. can be a representative of the original function (not scaled one) since optimal
solution for the logarithmic function will be identical with the optimal solution for
the original function.
Curves of different based ‘logarithmic functions’ can be found in Figure-3 below. As can be
seen, all of them are ‘​monotonic​’ and cutting x-axis from the same point ( ). In Logistic
Regression case, we unexceptionally use natural (10) as the base of our logarithmic function.
3
Which causes the ‘Bias’ in fitting process. Noise is natural procedure of data generating process. So,
even if we use all dataset we have (using the average hypothesis set (g_bar(x)), there is always a
‘approximation’ limit to the unknown target function.
4
Values produced by this ratio will be used to build ‘score bands’ which is the final part of the ‘Credit
Scoring Model’ building process.
5
Looking for the answer of the question: “How many does one unit change in a feature affect the target?”
6
e.g. Stochastic Gradient Descent or Quadratic Programming
5
Logistic Regression Classifier - Conceptual Guide Caglar Subasi, 2019/02
Figure-3: Logarithmic Functions with various bases
Passing through (where ) helps us to make more logical transformations in the way
of interpreting the ‘Event’ and ‘No-Event’ (log odds) ratio.
So, for the cases that we stay in the positive side of the function,
otherwise we pass the negative side. This makes a lot of sense while labeling observations in
the outcome space.
D. Objective Function
Like in other Machine Learning Classifiers , Logistic Regression has an ‘​objective function​’7
which tries to maximize ‘​likelihood function​’ of the experiment . This approach is known as8
‘Maximum Likelihood Estimation - MLE’ and can be written mathematically as follows.
where
● the output ,
● is the posterior probability which is equal to and
7
Naîve Bayes, Ensembled Trees, SVM, Neural Networks, etc.
8
Or minimizing the ​negative of ‘log likelihood function’ would be a tricky movement depend on the
optimization tool we have. If the aim is minimizing objective function can be called as the ‘Loss/Cost
Function’.
6
Logistic Regression Classifier - Conceptual Guide Caglar Subasi, 2019/02
● parameters is the vector of ‘weights/coefficients’ in
as we defined earlier. Before describing and optimizing this ​objective with respect to parameter
, it may better to shift ‘coin’ experiment in order to simplify remaining processes. So, the
‘​objective function​’ of ‘flipping a coin’ problem can be written in the below format.
where is the likelihood of success (let be Head), is independent Bernoulli random variable
( ) and the inner term is the ‘​joint likelihood distribution​’ function of the experiment. We
want to find the optimum value of in order to maximize this function. But how did we decide
that maximizing it will correspond our main goal which is getting high classification accuracy.
Same process performing in ‘Linear Regression’ is definitely clear, since choosing LSE as an
objective function surely and obviously brings the shortest distances between predictions ( )
and actual targets ( ).
To obtain the same clarity for Logistic Regression’s MLE case, we need to approach it in a
numerical manner. To do that, let’s assign miscellaneous values to likelihoods in the objective
function of coin experiment. Those likelihoods may or may not exhibit discordance with known
target values .
Examples Assigned Values Discordance Likelihood Function
1 Yes
2 No
3 No
4 Yes
We designed 4 examples that 2 of them has no discordance between and target, that is we
made a successful classification. So, when we compare the likelihood function returns, suitable
assignments of with respect to target produces higher likelihoods! This is why we choose to
maximize the ‘​log likelihood function​’ as an objective in Logistic Regression case above. More
formally, we can summarize this logic with in two steps given below.
7
Logistic Regression Classifier - Conceptual Guide Caglar Subasi, 2019/02
❖ for samples labeled as ‘1’ we desire to estimate as close to 1 as possible
❖ for samples labeled as ‘0’ we desire to estimate as close to 1 as possible
E. Optimizing Objectives
E.1. Getting the Gradient Equation (Differentiation)
E.1.1. Coin Experiment, ‘Average Learning’
We can ​join​ likelihood functions:
● for where is the number of successful (Head) occurrence and
● for where is the number of fail (Tail) occurrence
of coin experiment as follows:
Taking logarithm of the joint likelihood function, we get the log-likelihood:
which can be written in summation form:
Taking the derivative of this function with respect to and equalizing it to zero will bring us the
optimal​ value of which maximizes log-likelihood.
Distribute the summation operation:
8
Logistic Regression Classifier - Conceptual Guide Caglar Subasi, 2019/02
Multiplying through by we get:
Upon distributing, we see that two of the resulting terms cancel each other out:
Leaving us with:
Solve for :
See how we proved that the ‘​average​’ will bring the best estimation performance (maximum
likelihood) in the case of no inputs (explanatory variable) exist.
E.1.2. Credit Scoring Experiment, ‘Stochastic Learning’
How about taking the ’s into account like in a real Machine Learning experiment? Calculating
‘Gradients/Slopes’ of each observation in the input vector takes long time and generally is an
‘out-of-memory’ type operation when the input set is substantially large. Using a ‘stochastic’
process brings a remedy for that inefficiency by randomly selecting feature vectors from the data
set and calculating its gradient/slope only. This process repeats selecting random data points
and comparing their slopes until a convergence with respect to slope occurs. This procedure is
famously known as the ‘Stochastic Gradient Descent/Ascent - SGD/SGA’ optimization. A
symbolic travel of a SGD algorithm is illustrated in Figure-4 below.
9
Logistic Regression Classifier - Conceptual Guide Caglar Subasi, 2019/02
Figure-4: Gradient Descent Illustration
To apply SGD/SGA, we need to shift back to the ‘Credit Scoring’ experiment which has the
objective (joint likelihood) function given below and to get its Gradient Equation first.
Below operations are the ​sequential steps of the finding ‘Gradient Equation’ of objective function
of ‘Credit Scoring’ experiment.
Step-1​: Joint log-likelihood function
Step-2​: ​Taking log converts ​to
Step-3​: Substituting with
10
Logistic Regression Classifier - Conceptual Guide Caglar Subasi, 2019/02
Step-4​: Combining wrt
Step-5​: Merging two ‘log’ terms in the square bracket
Step-6​: Cancelling logarithm and exponential functions
Step-7​: Final equation that we will take partial derivatives
Step-8​: Partial derivatives
➔ Partial-1​:
➔ Partial-2​:
➔ Combine partials​:
11
Logistic Regression Classifier - Conceptual Guide Caglar Subasi, 2019/02
➔ Replace exponent term with corresponding conditional probability (posterior) term​:
➔ Take as common​:
which gives the final form of ​log-likelihood-gradient​ (​Gradient Equation​)​.
In summary, our goal is to find the optimum value of parameter which maximizes the
log-likelihood function of the ‘Credit Scoring’ experiment. Above steps provides us the
differentiated version of log-likelihood and it is expected that it will converge to local
maximum/minimum at the point where the ‘Gradient/Slope’ is zero.
So, in a regular optimization procedure algorithms will try to calculate vector of each data9
point which is a feature vector of dimension. Since calculating all coefficients is
computationally inefficient, SGD/SGA come to the stage and take our hands while walking on
the objective/loss curve to the bottom or peak.10
E.2. Finding Maxima-Minima
Since the final ‘Gradient Equation’ is in ‘transcendental’ form, that is contains non-algebraic
functions such as logarithm and exponent, it can not be solved directly (no closed form solution
exists). So we need ​approximation​ techniques, for example:
❖ Gradient Descent
❖ Newton-Raphson Method
In Machine Learning, we generally use Gradient Descent technique while trying to approximate
global maxima or minima of objective functions, because it has concrete advantages against
Newton-Raphson.
9
For example a ‘Gradient Descent’ algorithm which is not working ‘stochastically’ but sequentially trying
to compute all slopes on the objective/loss function curve/hyperplane.
10
Or generally a hyperplane like in the Figure-4
12
Logistic Regression Classifier - Conceptual Guide Caglar Subasi, 2019/02
Newton-Raphson's method is a root finding algorithm that maximizes a function using the11
knowledge of its second derivative (​Hessian Matrix​). That can be faster when the second
derivative is known and easy to compute (like in Logistic Regression). However, the analytical12
expression for the second derivative is often complicated or intractable, requiring a lot of
computation.
On the other hand, Gradient Descent maximizes/minimizes a function using knowledge of its
first derivative only. It simply follows the ​steepest descent from the current point to the desired
hill or hole. This is like rolling a ball down the graph of loss function (like in Figure-4) until it
comes to rest. Since Gradient Descent uses first derivatives, it is configured to find ​local
maxima/minima but we need to get global ones. To handle this problem, we use a ‘stochastic’
approach which calculates gradients randomly for different points of the loss curve/hyperplane
and compares all local minima/maxima each other to get globals. This procedure is illustrated in
Figure-5 below.
Figure-5: Stochastic Gradient Descent Illustration
A standard Gradient Descent algorithm is defined as follows where is the ‘​learning rate​’ and
symbolizes the Gradient Equation.
11
It is called as ‘root-finding method’ because it tries to find a point x satisfying f'(x) = 0 by approximating
f' with a linear function g and then solving for the root of that function explicitly. The root of g is not
necessarily the root of f', but it is under many circumstances a good guess.
12
Help us to know the ‘concavity’ of the objective function surface.
13
Logistic Regression Classifier - Conceptual Guide Caglar Subasi, 2019/02
In algorithm above, one takes steps proportional (learning rate) to the negative of the gradient
(or approximate gradient) of the function at the current point. If, instead, one takes steps
proportional to the positive of the gradient, one approaches to a local maximum of that function.
This procedure is then known as Gradient Ascent.
In order to get more technical knowledge about Gradient Descent algorithms, it would be better
to read ​Avinash Kadimisetty​’s post in Towards Data Science and watch the ​video of
3BLUE1BROWNSERIES channel in youtube which are definitely the most informative sharings
in my opinion! In addition to it, if you desire to get informed about Newton-Raphson Method and
its applications in Logistic Regression, please watch the meticulously prepared ​video of
CodeEmporium​’s channel in Youtube. All these sharings are listed in the references list at the
end of the article.
F. Further Readings
1. Sigmoid vs. ReLu​: Using ‘Sigmoid’ as an activation function will bring some
disadvantages while training the model. For example, its first derivative is not monotonic
as shown below.
Figure-6: Derivative of Sigmoid Function
14
Logistic Regression Classifier - Conceptual Guide Caglar Subasi, 2019/02
Figure-7: Sigmoid and ReLu Comparison
ReLu is the most popular activation function for NN type classifiers, nowadays. Detailed
explanations can be found from ​here​.
2. Logistic Regression vs. Naîve Bayes​: This is actually understanding the differences
between ‘Discriminative’ and ‘Generative’ models. ​Here exists a brief but an elegant
post.
G. References
➔ https://www.cs.cmu.edu/~mgormley/courses/10701-f16/schedule.html​ (Lecture-5)
➔ https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/pdfs/40%20LogisticRegressi
on.pdf
➔ http://kronosapiens.github.io/blog/2017/03/28/objective-functions-in-machine-learning.ht
ml
➔ https://towardsdatascience.com/gradient-descent-demystified-bc30b26e432a
➔ https://youtu.be/YMJtsYIp4kg
➔ https://newonlinecourses.science.psu.edu/stat414/node/191/
➔ https://datascience.stackexchange.com/questions/25444/advantages-of-monotonic-activ
ation-functions-over-non-monotonic-functions-in-neu
➔ https://stats.stackexchange.com/questions/253632/why-is-newtons-method-not-widely-u
sed-in-machine-learning
➔ https://stackoverflow.com/questions/12066761/what-is-the-difference-between-gradient-
descent-and-newtons-gradient-descent
➔ http://www.wikizero.biz/index.php?q=aHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2k
vTGluZWFyX2NsYXNzaWZpZXI
➔ https://en.wikipedia.org/wiki/Linear_classifier
➔ https://en.wikipedia.org/wiki/Logistic_regression
➔ https://www.youtube.com/watch?v=IHZwWFHWa-w&t=463s
➔ https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6
➔ https://sebastianraschka.com/faq/docs/naive-bayes-vs-logistic-regression.html
15

Logistic Regression Classifier - Conceptual Guide

  • 1.
    Logistic Regression Classifier- Conceptual Guide Caglar Subasi, 2019/02 LOGISTIC REGRESSION CLASSIFIER How It Works A Step-by-Step Guide Conceptual Logistic Regression is a ‘Statistical Learning’ technique categorized in ‘Supervised’ Machine Learning (ML) methods dedicated to ‘Classification’ tasks. It has gained a tremendous reputation for last two decades especially in financial sector due to its prominent ability of detecting defaulters. A contradiction appears when we declare a classifier whose name contains the term ‘Regression’ is being used for classification, but this is why Logistic Regression magical: using a linear regression equation to produce discrete binary outputs. And yes, it is also categorized in ‘Discriminative Models’ subgroup of ML methods like Support Vector1 Machines and Perceptron where all use linear equations as a building block and attempts to maximize the quality of output on a training set. 1 Complementary subgroup is called ‘Generative Models’ has members like ‘Naîve Bayes’ and ‘Fisher’s Linear Discriminants’. 1
  • 2.
    Logistic Regression Classifier- Conceptual Guide Caglar Subasi, 2019/02 Figure-1: From Decision Function to Decision Boundary We will follow the below guide throughout the article in the given order. As can be understood from the content, this article is just a ​conceptual manual intending to clarify ​technical workflow of Logistic Regression Classifier. After a long searching and reading period, I realized that there is a prolificacy for ​empirical studies but a striking scarcity of ​theoretical aspects of Machine Learning implementations. This is maybe why Yuval Noah Harari states in his famous ‘best seller’ book ‘Homodeus - A Brief History of Tomorrow’ that: “...In fact modernity is a surprisingly simple deal. The entire contract can be summarised in a single phrase: humans agree to give up ‘meaning’ in exchange for ‘power’...” With this article, my aim is to create a complete guide which provides the inner meaning of each step existing in Logistic Regression workflow. Hope you enjoy... Table of Content 1 A. Data Structure 3 B. Experiment Design 3 C. Decision/Activation Function 4 D. Objective Function 6 E. Optimizing Objectives 8 E.1. Getting the Gradient Equation (Differentiation) 8 E.1.1. Coin Experiment, ‘Average Learning’ 8 E.1.2. Credit Scoring Experiment, ‘Stochastic Learning’ 9 E.2. Finding Maxima-Minima 12 F. Further Readings 14 2
  • 3.
    Logistic Regression Classifier- Conceptual Guide Caglar Subasi, 2019/02 G. References 15 A. Data Structure Inputs are continuous feature-vectors ( ’s) of length , where and . So, the input matrix is which contains number of inputs (data points) each contains number of features. Inputs can be illustrated as a matrix like below. And output is discrete and binary variable, such that . B. Experiment Design Let’s we have a ‘​flipping/tossing a coin​’ experiment. Supposing the coin is a ​fair one brings us ‘​equally likely​’ outcomes of ‘Head’ and ‘Tail’. That is the ‘​posterior​’ probabilities are: where is an input matrix and contains all trials/observations and their features. Since in this ‘flipping coin’ experiment does not include any independent variable (feature), our input matrix includes only the trails we made, that is it will be a vector of ‘ ’ where is just symbolizing the first trial rather than an concrete input. But if we replace the experiment with a ‘Credit Scoring’ one, our outcome universe will still be discrete and binary (‘Default’ and ‘Not’), however the input vector returns to a matrix again since some features has shown above! 3
  • 4.
    Logistic Regression Classifier- Conceptual Guide Caglar Subasi, 2019/02 Another ​radical change expecting us after shifting the experiment is the ‘​uncertainty​’ affecting the ​fairness that we assume for coins. Like unfair coins, credits are hosting different chances to be defaulted due to the different characteristics of obligators. So, our ‘​posteriors​’ will not be ‘​equally likely​’ anymore. C. Decision/Activation Function Unfairness described in previous part brings the problem of uncertainty in the process and the necessity of anticipation. As a ‘​Supervised-Classification​’ method, ​Logistic Regression helps us to ​converge those ‘uncertain’ posteriors with a ​differentiable ‘​decision function​’ drawn in2 Figure-2 below. Figure-2: Logistic Sigmoid Activation Function This function is called as ‘​logistic function​’ or ‘​sigmoid function​’ and helps us to shrink real valued continuous inputs into a range of (0,1) which is gloriously useful while dealing with probabilities! With the help of ‘logistic function’, we can write our posteriors like below. where is a function consisting our features ( ) and their corresponding weights/coefficients ( ) in a ​linear form​ shown below. 2 Since the curve it creates is continuous. But a ‘​Signum Function​ - sign(x)’ is not, since it is discrete. 4
  • 5.
    Logistic Regression Classifier- Conceptual Guide Caglar Subasi, 2019/02 where and is representing the ‘​random error process - noise ’ inevitably3 happening in the data generating process. By using the posterior equation above, we can rewrite the estimation function in the form of ‘​posterior probability​’ as shown below. which is famously known as the ‘​log of odds ’ ratio. One can realize its usefulness while trying4 to interpreting the coefficients of linear regression function .5 Using ‘logarithmic’ transformation helps our learning mechanism in 3 main aspects: 1. It makes the values more ‘normalized’ (big values to smaller and vice versa). Normalization (scaling) help us to reach more consistent (wrt magnitudes) coefficients that is none of them affect outcomes in a dominant way! 2. It makes operations inside of it more easier to perform (multiplications → summations, divisions → subtractions, exponents → multiplications) 3. It creates a curve/hyperplane (value sequence) which has ‘​monotonicity​’. Functions which are increasing or decreasing monotonically: 3.1. can be traversed by an ‘​optimization solver ’ more efficiently with respect to time6 since they do not consist ‘local minima/maxima’ and 3.2. can be a representative of the original function (not scaled one) since optimal solution for the logarithmic function will be identical with the optimal solution for the original function. Curves of different based ‘logarithmic functions’ can be found in Figure-3 below. As can be seen, all of them are ‘​monotonic​’ and cutting x-axis from the same point ( ). In Logistic Regression case, we unexceptionally use natural (10) as the base of our logarithmic function. 3 Which causes the ‘Bias’ in fitting process. Noise is natural procedure of data generating process. So, even if we use all dataset we have (using the average hypothesis set (g_bar(x)), there is always a ‘approximation’ limit to the unknown target function. 4 Values produced by this ratio will be used to build ‘score bands’ which is the final part of the ‘Credit Scoring Model’ building process. 5 Looking for the answer of the question: “How many does one unit change in a feature affect the target?” 6 e.g. Stochastic Gradient Descent or Quadratic Programming 5
  • 6.
    Logistic Regression Classifier- Conceptual Guide Caglar Subasi, 2019/02 Figure-3: Logarithmic Functions with various bases Passing through (where ) helps us to make more logical transformations in the way of interpreting the ‘Event’ and ‘No-Event’ (log odds) ratio. So, for the cases that we stay in the positive side of the function, otherwise we pass the negative side. This makes a lot of sense while labeling observations in the outcome space. D. Objective Function Like in other Machine Learning Classifiers , Logistic Regression has an ‘​objective function​’7 which tries to maximize ‘​likelihood function​’ of the experiment . This approach is known as8 ‘Maximum Likelihood Estimation - MLE’ and can be written mathematically as follows. where ● the output , ● is the posterior probability which is equal to and 7 Naîve Bayes, Ensembled Trees, SVM, Neural Networks, etc. 8 Or minimizing the ​negative of ‘log likelihood function’ would be a tricky movement depend on the optimization tool we have. If the aim is minimizing objective function can be called as the ‘Loss/Cost Function’. 6
  • 7.
    Logistic Regression Classifier- Conceptual Guide Caglar Subasi, 2019/02 ● parameters is the vector of ‘weights/coefficients’ in as we defined earlier. Before describing and optimizing this ​objective with respect to parameter , it may better to shift ‘coin’ experiment in order to simplify remaining processes. So, the ‘​objective function​’ of ‘flipping a coin’ problem can be written in the below format. where is the likelihood of success (let be Head), is independent Bernoulli random variable ( ) and the inner term is the ‘​joint likelihood distribution​’ function of the experiment. We want to find the optimum value of in order to maximize this function. But how did we decide that maximizing it will correspond our main goal which is getting high classification accuracy. Same process performing in ‘Linear Regression’ is definitely clear, since choosing LSE as an objective function surely and obviously brings the shortest distances between predictions ( ) and actual targets ( ). To obtain the same clarity for Logistic Regression’s MLE case, we need to approach it in a numerical manner. To do that, let’s assign miscellaneous values to likelihoods in the objective function of coin experiment. Those likelihoods may or may not exhibit discordance with known target values . Examples Assigned Values Discordance Likelihood Function 1 Yes 2 No 3 No 4 Yes We designed 4 examples that 2 of them has no discordance between and target, that is we made a successful classification. So, when we compare the likelihood function returns, suitable assignments of with respect to target produces higher likelihoods! This is why we choose to maximize the ‘​log likelihood function​’ as an objective in Logistic Regression case above. More formally, we can summarize this logic with in two steps given below. 7
  • 8.
    Logistic Regression Classifier- Conceptual Guide Caglar Subasi, 2019/02 ❖ for samples labeled as ‘1’ we desire to estimate as close to 1 as possible ❖ for samples labeled as ‘0’ we desire to estimate as close to 1 as possible E. Optimizing Objectives E.1. Getting the Gradient Equation (Differentiation) E.1.1. Coin Experiment, ‘Average Learning’ We can ​join​ likelihood functions: ● for where is the number of successful (Head) occurrence and ● for where is the number of fail (Tail) occurrence of coin experiment as follows: Taking logarithm of the joint likelihood function, we get the log-likelihood: which can be written in summation form: Taking the derivative of this function with respect to and equalizing it to zero will bring us the optimal​ value of which maximizes log-likelihood. Distribute the summation operation: 8
  • 9.
    Logistic Regression Classifier- Conceptual Guide Caglar Subasi, 2019/02 Multiplying through by we get: Upon distributing, we see that two of the resulting terms cancel each other out: Leaving us with: Solve for : See how we proved that the ‘​average​’ will bring the best estimation performance (maximum likelihood) in the case of no inputs (explanatory variable) exist. E.1.2. Credit Scoring Experiment, ‘Stochastic Learning’ How about taking the ’s into account like in a real Machine Learning experiment? Calculating ‘Gradients/Slopes’ of each observation in the input vector takes long time and generally is an ‘out-of-memory’ type operation when the input set is substantially large. Using a ‘stochastic’ process brings a remedy for that inefficiency by randomly selecting feature vectors from the data set and calculating its gradient/slope only. This process repeats selecting random data points and comparing their slopes until a convergence with respect to slope occurs. This procedure is famously known as the ‘Stochastic Gradient Descent/Ascent - SGD/SGA’ optimization. A symbolic travel of a SGD algorithm is illustrated in Figure-4 below. 9
  • 10.
    Logistic Regression Classifier- Conceptual Guide Caglar Subasi, 2019/02 Figure-4: Gradient Descent Illustration To apply SGD/SGA, we need to shift back to the ‘Credit Scoring’ experiment which has the objective (joint likelihood) function given below and to get its Gradient Equation first. Below operations are the ​sequential steps of the finding ‘Gradient Equation’ of objective function of ‘Credit Scoring’ experiment. Step-1​: Joint log-likelihood function Step-2​: ​Taking log converts ​to Step-3​: Substituting with 10
  • 11.
    Logistic Regression Classifier- Conceptual Guide Caglar Subasi, 2019/02 Step-4​: Combining wrt Step-5​: Merging two ‘log’ terms in the square bracket Step-6​: Cancelling logarithm and exponential functions Step-7​: Final equation that we will take partial derivatives Step-8​: Partial derivatives ➔ Partial-1​: ➔ Partial-2​: ➔ Combine partials​: 11
  • 12.
    Logistic Regression Classifier- Conceptual Guide Caglar Subasi, 2019/02 ➔ Replace exponent term with corresponding conditional probability (posterior) term​: ➔ Take as common​: which gives the final form of ​log-likelihood-gradient​ (​Gradient Equation​)​. In summary, our goal is to find the optimum value of parameter which maximizes the log-likelihood function of the ‘Credit Scoring’ experiment. Above steps provides us the differentiated version of log-likelihood and it is expected that it will converge to local maximum/minimum at the point where the ‘Gradient/Slope’ is zero. So, in a regular optimization procedure algorithms will try to calculate vector of each data9 point which is a feature vector of dimension. Since calculating all coefficients is computationally inefficient, SGD/SGA come to the stage and take our hands while walking on the objective/loss curve to the bottom or peak.10 E.2. Finding Maxima-Minima Since the final ‘Gradient Equation’ is in ‘transcendental’ form, that is contains non-algebraic functions such as logarithm and exponent, it can not be solved directly (no closed form solution exists). So we need ​approximation​ techniques, for example: ❖ Gradient Descent ❖ Newton-Raphson Method In Machine Learning, we generally use Gradient Descent technique while trying to approximate global maxima or minima of objective functions, because it has concrete advantages against Newton-Raphson. 9 For example a ‘Gradient Descent’ algorithm which is not working ‘stochastically’ but sequentially trying to compute all slopes on the objective/loss function curve/hyperplane. 10 Or generally a hyperplane like in the Figure-4 12
  • 13.
    Logistic Regression Classifier- Conceptual Guide Caglar Subasi, 2019/02 Newton-Raphson's method is a root finding algorithm that maximizes a function using the11 knowledge of its second derivative (​Hessian Matrix​). That can be faster when the second derivative is known and easy to compute (like in Logistic Regression). However, the analytical12 expression for the second derivative is often complicated or intractable, requiring a lot of computation. On the other hand, Gradient Descent maximizes/minimizes a function using knowledge of its first derivative only. It simply follows the ​steepest descent from the current point to the desired hill or hole. This is like rolling a ball down the graph of loss function (like in Figure-4) until it comes to rest. Since Gradient Descent uses first derivatives, it is configured to find ​local maxima/minima but we need to get global ones. To handle this problem, we use a ‘stochastic’ approach which calculates gradients randomly for different points of the loss curve/hyperplane and compares all local minima/maxima each other to get globals. This procedure is illustrated in Figure-5 below. Figure-5: Stochastic Gradient Descent Illustration A standard Gradient Descent algorithm is defined as follows where is the ‘​learning rate​’ and symbolizes the Gradient Equation. 11 It is called as ‘root-finding method’ because it tries to find a point x satisfying f'(x) = 0 by approximating f' with a linear function g and then solving for the root of that function explicitly. The root of g is not necessarily the root of f', but it is under many circumstances a good guess. 12 Help us to know the ‘concavity’ of the objective function surface. 13
  • 14.
    Logistic Regression Classifier- Conceptual Guide Caglar Subasi, 2019/02 In algorithm above, one takes steps proportional (learning rate) to the negative of the gradient (or approximate gradient) of the function at the current point. If, instead, one takes steps proportional to the positive of the gradient, one approaches to a local maximum of that function. This procedure is then known as Gradient Ascent. In order to get more technical knowledge about Gradient Descent algorithms, it would be better to read ​Avinash Kadimisetty​’s post in Towards Data Science and watch the ​video of 3BLUE1BROWNSERIES channel in youtube which are definitely the most informative sharings in my opinion! In addition to it, if you desire to get informed about Newton-Raphson Method and its applications in Logistic Regression, please watch the meticulously prepared ​video of CodeEmporium​’s channel in Youtube. All these sharings are listed in the references list at the end of the article. F. Further Readings 1. Sigmoid vs. ReLu​: Using ‘Sigmoid’ as an activation function will bring some disadvantages while training the model. For example, its first derivative is not monotonic as shown below. Figure-6: Derivative of Sigmoid Function 14
  • 15.
    Logistic Regression Classifier- Conceptual Guide Caglar Subasi, 2019/02 Figure-7: Sigmoid and ReLu Comparison ReLu is the most popular activation function for NN type classifiers, nowadays. Detailed explanations can be found from ​here​. 2. Logistic Regression vs. Naîve Bayes​: This is actually understanding the differences between ‘Discriminative’ and ‘Generative’ models. ​Here exists a brief but an elegant post. G. References ➔ https://www.cs.cmu.edu/~mgormley/courses/10701-f16/schedule.html​ (Lecture-5) ➔ https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/pdfs/40%20LogisticRegressi on.pdf ➔ http://kronosapiens.github.io/blog/2017/03/28/objective-functions-in-machine-learning.ht ml ➔ https://towardsdatascience.com/gradient-descent-demystified-bc30b26e432a ➔ https://youtu.be/YMJtsYIp4kg ➔ https://newonlinecourses.science.psu.edu/stat414/node/191/ ➔ https://datascience.stackexchange.com/questions/25444/advantages-of-monotonic-activ ation-functions-over-non-monotonic-functions-in-neu ➔ https://stats.stackexchange.com/questions/253632/why-is-newtons-method-not-widely-u sed-in-machine-learning ➔ https://stackoverflow.com/questions/12066761/what-is-the-difference-between-gradient- descent-and-newtons-gradient-descent ➔ http://www.wikizero.biz/index.php?q=aHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2k vTGluZWFyX2NsYXNzaWZpZXI ➔ https://en.wikipedia.org/wiki/Linear_classifier ➔ https://en.wikipedia.org/wiki/Logistic_regression ➔ https://www.youtube.com/watch?v=IHZwWFHWa-w&t=463s ➔ https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6 ➔ https://sebastianraschka.com/faq/docs/naive-bayes-vs-logistic-regression.html 15