Week2 - 2022 - Biological Data Science - Polikar - Traditional Machine Learning Lecture
Week2 - 2022 - Biological Data Science - Polikar - Traditional Machine Learning Lecture
P&
PR
L
Signal Processing &
Pattern Recognition Laboratory
@ Rowan University
presents
These lecture notes are prepared by Robi Polikar. Unauthorized use, including duplication, even in part, is not allowed
without an explicit written permission. Such permission will be given – upon request – for noncommercial educational
purposes if you agree to all of the following:
1. Restrict the usage of this material for noncommercial and nonprofit educational purposes only; AND
2. The entire presentation is kept together as a whole, including this page and this entire notice; AND
3. You include the following link/reference on your site: © Robi Polikar http://users.rowan.edu/~polikar.
Getting Ready
If you do not already have Matlab, and have not already obtained a 30-day free trial version, go to:
https://www.mathworks.com/campaigns/products/trials.html and download / install Matlab. It is a
large file, so start it now, while we go through some of the basic fundamentals.
If asked, choose – at a minimum – the following toolboxes to be installed (you can install others as
well, if you wish):
Parallel Computing Toolbox
Deep Learning Toolbox
Statistics and Machine Learning Toolbox
Optimization Toolbox
Alternatively, you can buy Student Version of Matlab for $99, which includes Matlab, Simulink
and 10 toolboxes including most of the above (plus $10 for Deep Learning Toolbox)
Download and unzip the following folder. Place it somewhere (e.g., desktop), where you can
easily find it: https://users.rowan.edu/~polikar/ML_Workshopfiles.zip
Engineering &
Computer Science Statistics Neuroscience
Optimization
Formal definitions
Machine learning is the “field of study that gives computers the ability to learn [from data] without being explicitly
programmed [for that task]. (A. Samuel, 1959)
A computer program is said to learn from experience E [data] with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured by P, improves with experience E. (T. Mitchell, 1997)
Machine learning is the study of computer algorithms that improve automatically through experience and by the use
of data. It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data,
known as "training data", in order to make predictions or decisions without being explicitly programmed to do so.
(Wikipedia 2021)
Also related to Data Mining, which is involved with discovery of hidden and previously unknown
patterns and properties of the data.
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Applications
Applications – a very brief / abbreviated list
Speech recognition / speaker identification / natural language processing
Handwritten character recognition
Financial data analysis / stock picking
Weather prediction / hurricane path prediction
Fingerprint identification
DNA sequence / phylogenetics / protein folding
Radar tracking, friend-foe identification
Biometrics / Iris scan identification
Topographical remote sensing
Text mining / web mining
Search engine algorithms
Energy pricing / demand prediction
Malware detection
Cyber / infrastructure security
Self driving vehicles
Recommender systems
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Machine Learning
Subfields & Application Domains
https://www.wordstream.com/blog/ws/2017/07/28/machine-learning-applications
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Fundamentals of Machine Learning
Auslander, N.; Gussow, A.B.; Koonin, E.V. Incorporating Machine Learning into Established
Applications of ML
in Bioinformatics
Auslander, N.; Gussow, A.B.; Koonin, E.V. Incorporating Machine Learning into Established
Common ML
Algorithms
A
A E. Alpaydin, Introduction to Machine Learning, MIT Press, 2004
B C. Bishop, Machine Learning & Pattern Recognition, Springer, 2006
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Terminology
Clustering: Given data of objects obtained from an unknown number and nature of categories, grouping of
such data into clusters based on some measure of similarity.
Data mining: Given large volumes of data obtained from the web pages, group the corresponding web pages into logically
meaningful sets (e.g., news articles, shopping sites, medical information, gaming site, other commercial sites, etc.)
Given large number of sequences, group them into taxonomical classes
AT content
Unlabeled data points:
Each dot represents one sequence
CG content RP
x1 feature 1 x2
x
feature 2
x = 2
x
xd feature d
x1
56 # of ACGA per million bp
80 # of CCGT per million bp
x=
120 # of ATCG per million bp
x3 Feature space (3D)
220 # of GGAT per million bp
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Terminology
Class: The category to which a given object belongs, typically denoted by ω
Pattern: A collection of features of an object under consideration, along with the correct class information of
that object. In classification, a pattern is a pair of variables, 𝒙𝒙, 𝜔𝜔 where 𝒙𝒙 is the feature vector and 𝜔𝜔 is the
corresponding label
Instance/ Exemplar: Any given example pattern of an object
Decision boundary: A boundary in the d-dimensional feature space that separates patterns of different
classes from each other.
Sensor 2 / Feature2
Training Data: Data used during training of a classifier for
which the correct labels are a priori known
Test / Validation Data: Data not used during training, but rather
set aside to estimate the true (generalization) performance of a
classifier, for which correct labels are also a priori known
Field Test Data: Unknown data to be classified for which the
classifier is ultimately trained. The correct class labels for these
data are not known a priori.
Sensor 1 / Feature 1
Exemplars / patterns / measurements from
class 1 class2 class 3 class 4
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Terminology
Cost Function: A quantitative measure that represents the cost of making an error. The classifier is trained to minimize
this function.
Classifier: A parametric or nonparametric model which adjusts its parameters or weights to find the correct decision
boundaries through a learning algorithm using a training dataset – such that a cost function is minimized.
Model: A simplified mathematical / statistical construct that mimics (acts like) the underlying physical phenomenon that
generated the original data.
Parametric Model: A probabilistic / statistical model that assumes that the underlying phenomenon follows a specific
known probability distribution. The parameters of such a model are the parameters of the distribution.
A classifier based on determining the parameters of a distribution is also called a generative model as the underlying
distribution can be generated from the parameters.
Examples: Bayes classifier, expectation-maximization algorithm.
Nonparametric model: A model that does not assume a specific distribution, and that typically follows an optimization
algorithm to minimize error.
A classifier based on using a nonparametric approach is also called a discriminative model , as the decision is then based
on a discriminant (or discriminant function).
Examples: Neural networks, decision trees, support vector machines.
Number of correctly
cK 1 cK 2 cKK classified instances
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Types of Classifiers
A selection
Bayes classifier
K-means
Naïve Bayes Classifier
K-nearest neighbor Fuzzy C-means
Discriminant classifiers Hierarchical clustering
Linear / quadratic discriminant classifier Adaptive Resonance Theory (ART)
Neural networks Self organized maps (SOMs)
Probabilistic neural network ISODATA
Multilayer perceptron neural network
Radial basis function neural network
Hopfield network
Deep neural networks (convolutional neural nets, autoencoders, etc.)
Kernel methods:
Support vector machines
Decision trees
Classification and regression tree (CART)
Random forest
Taxonomic classification
Given a sequence, which genus does it belong?
Let’s say we find the following features to be potentially
useful:
• 𝒙𝒙𝟏𝟏 : # of genes per read
• 𝒙𝒙𝟐𝟐 : average gene length per read
Which one is the better feature?
Adapted from Duda, Hart & Stork, Pattern Classification, 2/e Wiley, 2000
Genus 1 Genus 2
Genus 1 Genus 2
𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐
𝒙𝒙𝟐𝟐 𝒙𝒙𝟐𝟐
𝒙𝒙𝟏𝟏
Which of the boundaries would you choose? Genus 1 Genus 2
Simple linear boundary – training error > 0
Nonlinear complex boundary – tr. error = 0
Simpler nonlinear boundary – tr. error > 0 D
𝒙𝒙𝟐𝟐
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Overfitting
Overfitting: Using too complex of a model (with many adjustable parameters) to explain a
relatively simple underlying phenomenon
Results in learning the noise in the data
M
y ( x, w ) = w0 + w1 x + w2 x + + wM x =
2 M
∑w x
i =0
i
i
B
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
PR Occam’s Razor
William of Occam
(1288-1348)
𝒙𝒙𝟏𝟏
D
𝒙𝒙𝟐𝟐
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
A Complete
ML System
Data Acquisition
D
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Let’s Ponder…
Let’s ponder on that question “Determine the right type of classifier” for a little bit…
Of the many, many classification algorithms, which is the best one?
If you are stranded in a desert island and can only bring one classification algorithm, what would that
be?
The denominator, summation over all values of B, is just a normalization constant, ensuring
that all probabilities add up to 1.
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Bayesian Way
of Thinking
The classic war in statistics: Frequentists vs. Bayesians
They cannot even agree on the meaning of probability
• Frequentist: the expected likelihood of an event 𝐴𝐴, over a long run: 𝑃𝑃(𝐴𝐴) = 𝑛𝑛/𝑁𝑁.
• Bayesian: measure of plausibility of an event happening, given an observation providing incomplete data, and
previous (sometimes / possibly subjective) degree of belief (known as the prior, or a priori probability)
Many phenomena of random nature can be explained by the frequentist definition of probability:
• The probability of hitting the jackpot in lottery;
• The probability that there will be at least 10 non-rainy days in Philadelphia in October;
• The probability that at least one of you is born in July;
• The probability that the sum of two random cards will be 21;
…but some cannot!
• The probability that a major catastrophe will end life on Earth in the next 100 years ;
• The probability that the conflict in Syria will end by 2025;
• The probability that there will be another major recession in the next 10 years;
• The probability that fossil-based energy will be obsolete in 50 years.
• The probability that there will be another pandemic in the next 20 years
Yet, you can make approximate estimations of such probabilities
You are following a Bayesian way thinking to do so
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Bayesian Way
of Thinking
In Bayesian statistics, we compute the probability based on three pieces of information:
Prior: Our (subjective?) degree of belief that the event is plausible in the first place.
Likelihood: The probability of making an observation, under the condition that the event has occurred: how likely
is it to observe what I just observed, if event 𝐴𝐴 did in fact happen (or, how likely is it to observe this outcome, if A [class
𝜔𝜔𝐴𝐴 ] were true). Likelihood describes what kind of data we expect to see in each class.
Evidence: The probability of making such an observation.
It is the combination of these three that gives the probability of an event, given that an observation (however incomplete
information it may provide) has been made. The probability computed based on such an observation is then called the
posterior probability.
Given the observation, Bayesian thinking updates the original belief (the prior) based on the likelihood and evidence.
𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 × 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 =
𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒
Sometimes, the combination of evidence and likelihood are so compelling that it can overwrite our original belief.
• Recall the fossil-based energy example: If asked in early 1900s, such a phenomenon is not on anyone’s radar prior: very low
(near zero). In 2022, we now have alternate energy sources (solar, wind, etc.) Likelihood: 𝑃𝑃 𝑛𝑛𝑛𝑛 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒):
Very high. This high likelihood trumps our low prior Posterior: 𝑃𝑃 𝑛𝑛𝑛𝑛 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓|𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 : very high!
Likelihood: The (class-conditional) probability of observing a feature value of x, Prior Probability: The total probability of
given that the correct class is ωj, or what kind of data do we expect to see in class correct class being class ωj determined based
𝜔𝜔𝑗𝑗 . All things being equal, the category with higher class conditional probability on prior experience (before an observation is
made)
is more “likely” to be the correct class.
Posterior Probability: The (conditional) probability of correct class Evidence: The total probability of observing the
being ωj, given that feature value x has been observed. Based on the feature value as x. Serves as a normalizing constant,
measurement (observation), the probability of correct class being ωj ensuring that posterior probabilities add up to 1
has shifted from 𝑃𝑃(𝜔𝜔𝑗𝑗) to 𝑃𝑃(𝜔𝜔𝑗𝑗|𝑥𝑥)
A Bayes classifier, decides on the class 𝜔𝜔𝑗𝑗 that has the largest posterior probability.
The Bayes classifier is statistically the best classifier one can possibly construct.
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
How do we compute
Class Conditional Probabilities?
𝜔𝜔1 : Genus 1 𝑃𝑃(𝑥𝑥|𝜔𝜔1 ): Class conditional probability for Genus 1
𝜔𝜔2: Genus 2 𝑃𝑃(𝑥𝑥|𝜔𝜔2 ): Class conditional probability for Genus 2
Likelihood: For example, given that Genus 2 (ω2) is observed, what is the probability we had 12 𝑘𝑘𝑘𝑘𝑘𝑘𝑟𝑟1 occurrences in a given read)?
Or simply, what is the probability that we will identify Genus 2 when there are 12 𝑘𝑘𝑘𝑘𝑘𝑘𝑟𝑟1 occurences?
Or, how likely is it that a Genus 2 organism is associated with 12 𝑘𝑘𝑘𝑘𝑘𝑘𝑟𝑟1 occurrences in any given read?
𝑝𝑝 𝑥𝑥|𝜔𝜔𝑗𝑗 ⋅ 𝑃𝑃 𝜔𝜔𝑗𝑗
𝑃𝑃 𝜔𝜔𝑗𝑗 |𝑥𝑥 =
𝑝𝑝 𝑥𝑥
𝑝𝑝 𝑥𝑥|𝜔𝜔𝑗𝑗 ⋅ 𝑃𝑃 𝜔𝜔𝑗𝑗
= 𝐶𝐶
∑𝑘𝑘=1 𝑝𝑝 𝑥𝑥|𝜔𝜔𝑘𝑘 ⋅ 𝑃𝑃 𝜔𝜔𝑘𝑘
Choose 𝜔𝜔𝑖𝑖 if 𝑃𝑃(𝜔𝜔𝑖𝑖 | 𝑥𝑥) > 𝑃𝑃(𝜔𝜔𝑗𝑗 | 𝑥𝑥) for all 𝑖𝑖 = 1,2, … , 𝑐𝑐
𝑑𝑑
If – and only if – the random variables 𝑝𝑝 𝐱𝐱 = 𝑝𝑝 𝑥𝑥1 ⋅ 𝑝𝑝 𝑥𝑥2 . . . 𝑝𝑝 𝑥𝑥𝑑𝑑 = � 𝑝𝑝 𝑥𝑥𝑖𝑖
in a vector are statistically independent 𝑖𝑖=1
While the notation changes only slightly, the implications are quite substantial:
The curse of dimensionality
3-D
2-D RP
20*20*30=12,000! 20*20*20*30=240,000!
observations observations!
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Well…That Sucks!
Yup! While the Bayes classifier is the best classifier, in most practical applications, you cannot use it.
Why? Because it requires a lot of data from each feature – something you may not have
That is why there are scores of other classifiers out there for you to choose – each is generally good for certain
applications and not so optimal for others.
So, we wasted a perfectly good hour on something we cannot use???
Not quite.
We will come back to Bayes Classifier in the second half.
X. Wu, V. Kumar, J.R. Quinlan, J. Ghosh, et al., “Top 10 Algorithms in Data Mining,” Knowledge Information Systems, vol. 14, pp. 1-37, 2008.
* C4.5 is listed as one of the top 10 in Wu et al. paper. Dr. Polikar disagrees with this, as C4.5 is a variant of CART. The MLP is a far more
deserving classifier to be in the top 10. Also, note that J. Quinlan, the creator of C4.5, is one of the authors of this paper.
K-Nearest Neighbor
Given a set of labeled training points, a test instance should be given
the label that appears most abundantly in its surrounding.
k=11
Sensor 2 Measurements
(feature 2)
Sensor 1 Measurements
(feature 1)
Measurements from class 1 class2 class 3 class 4
If the likelihoods follow a different, non-Gaussian (but of known type, say chi-square) distribution,
compute the parameters of that distribution for (1) and use the definition of that distribution for (2).
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Naïve Bayes Classifier
(Unknown Distribution type)
If the form of the distribution is not known, then it must be estimated. Use either a density estimation lecture (such as
Parzen windows, k-nearest neighbor, etc.), or simply follow the histogram approach showed earlier (slide 46):
For each class, look at the minimum and maximum values of each feature
Divide that range into a meaningful bins, based on the
number of training data (typically, minimum of 10)
• Optimize the number of bins vs. number of instances in each bin
• The larger the number of bins, the smoother the estimate
• The larger the number of instances in each bin the more accurate the estimate
• For a given dataset size, you can only maximize one or the other.
Create the histogram by counting the number of instances that fall into each bin
The distribution is the estimate of the class conditional likelihood of that feature.
Follow steps (3) and (4) of the pseudo code in the previous slide.
Continue at home…
0
0 5
D
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
PR
1. Unlike just about every other classification algorithm we have seen so far, decision trees are not based on a
similarity or distance measure
2. Decision trees can handle nominal (categorical), as well as ordinal (ordered) and cardinal (numeric) data,
while most other classifiers require such data to be transformed to ordinal data.
3. The decisions made by a decision tree are intuitive as they represent the information in a hierarchical structure
which can be traced to. Hence, it is easy to determine the reason for a specific decision by tracing that decision
to specific values of the features. Unlike most other classifiers, decision trees are NOT black boxes.
Decision trees are members of the more general class of models known as Graphical Models, as
they are described using graph theory and graph terminology.
load fisheriris
ctree = ClassificationTree.fit(meas,species, ‘classnames’, {‘setosa’, ‘versicolor’, ‘virginica’}, ‘predictornames’, {'SL', 'SW', 'PL', 'PW'});
view(ctree)
view(ctree,'mode','graph')
..
Wij zk J(w3)
....
yj
..
zc
…
x(d-1)
w1 w2 w3 a
i=1,2,…d
j=1,2,…,H
k=1,2,…c η1 η2
xd
𝑑𝑑 𝐻𝐻
Σ f
…
wi y
𝐰𝐰 = 𝑤𝑤1 , … , 𝑤𝑤𝑑𝑑 xi 𝑛𝑛𝑛𝑛𝑛𝑛 = � 𝑤𝑤𝑖𝑖 𝑥𝑥𝑖𝑖 = 𝐱𝐱 ⋅ 𝐰𝐰 𝑇𝑇 + 𝑤𝑤0
…
𝑖𝑖=0
wd-1 𝑦𝑦 = 𝑓𝑓 𝑛𝑛𝑛𝑛𝑛𝑛
xd-1 𝑓𝑓 ∶ Activation function
wd
RP
xd
d input
nodes
H hidden
layer nodes
x1
c output
nodes
x2 z1
Wjk
..
Wij zk
....
yj
..
zc
…
x(d-1)
i=1,2,…d
xd j=1,2,…,H
k=1,2,…c
RP
Present Examples
Indicate Desired Outputs
Determine
Synaptic
Weights “knowledge”
Stage 2: Network Testing
RP
c output
nodes
𝑑𝑑
x2 z1
……...
𝑦𝑦𝑗𝑗 = 𝑓𝑓 𝑛𝑛𝑛𝑛𝑡𝑡𝑗𝑗 = 𝑓𝑓 � 𝑤𝑤𝑗𝑗𝑗𝑗 𝑥𝑥𝑖𝑖
𝑖𝑖=1
Wkj
..
……....
netj
RP Wji netk zk
netj yj
𝐻𝐻
..
zc 𝑧𝑧𝑘𝑘 = 𝑓𝑓 𝑛𝑛𝑛𝑛𝑡𝑡𝑘𝑘 = 𝑓𝑓 � 𝑤𝑤𝑘𝑘𝑘𝑘 𝑦𝑦𝑗𝑗
…
x(d-1) 𝑗𝑗=1
netk
i=1,2,…d
xd j=1,2,…,H
Fundamentals of Machine Learning
k=1,2,…c © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Activation Functions
Threshold (Logic Unit) Linear / Identity Sigmoid Relu*
15 1 15
β = 0.9
β = 0.5
1
10 0.8 β = 0.25
β = 0.1
0.5 5 10
0.6
f(net)
f(net)
f(net)
f(net)
0 0
0.4
-5
5
-0.5
0.2
-10
-1
-15 0 0
net
net net net RP
𝐰𝐰 ∗ = argmin𝐰𝐰 𝐽𝐽 𝐰𝐰
J(w1) -∇ J(w1)
𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕
J(w2) ∇𝐽𝐽 𝐰𝐰 = , ,⋯,
-∇ J(w2) 𝜕𝜕𝑤𝑤1 𝜕𝜕𝑤𝑤2 𝜕𝜕𝑤𝑤𝑑𝑑+1
J(w3)
Basic Gradient Descent
Initialize 𝒘𝒘, threshold θ,
η(𝑘𝑘), 𝑘𝑘 = 0;
w1 w2 w3 w
RP do 𝑘𝑘𝑘𝑘 + 1
η1 η2 𝒘𝒘 𝒘𝒘 − η(𝑘𝑘) . ∇𝐽𝐽(𝒘𝒘)
until |η(𝑘𝑘) . ∇𝐽𝐽(𝒘𝒘)| < θ
return 𝒘𝒘
end
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Some Issues to Consider…
There are many forms of gradient descent that address these issues: Newton’s descent,
the momentum term, conjugate gradient algorithm, stochastic gradient descent, etc.,
etc., etc.
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Look Ma…
Not A Single Line of Code
0.9
6
4
0.8
2 4
0.7
2 0.6
0
0.5
-2 0
0.4
-4 -2
0.3
-6 -4 0.2
0.1
-8 -6
-6 -4 -2 0 2 4 6 8 -4 -3 -2 -1 0 1 2 3 4 5 6
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
MLP based classification of the test data MLP based classification of the test data MLP based classification of the test data
6 5 1
4 0.9
4
3 0.8
2 0.7
2
0.6
1
0
0.5
0
0.4
-2
-1
0.3
-2
-4
0.2
-3
0.1
-6
-6 -4 -2 0 2 4 6 -4
-3 -2 -1 0 1 2 3 4 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Feature 2
Feature 2
Maximum
RP RP margin
Feature 1 Feature 1
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Maximizing Margins
TK
r0 w = g ( x ) x =0 = w T x + w0 = w0
w
⇒ r0 =0
w
D
w0 determines the location of the hyperplane
! w determines the direction of the hyperplane
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Formalizing the Problem
Recall that the separating hyperplane (decision boundary) is given by
(using b instead of w0, as commonly done in SVM literature)
g ( x ) = w T x + w0 = w T x + b
Given a two-class training data xi with labels yi = +1 and yi= -1. Then,
w T x + b ≥ 1 ⇒ x ∈ ω1
Class ω1
wT x + b =0 w w T x + b = 0 ⇒ x on the boundary
Feature 2
w T x + b ≤ −1 ⇒ x ∈ ω2
r*
( )
yi w T xi + b ≥ 1, ∀i Correct classification
for all points
RP
r Distance of a point x on the margin (where g(x)=1)
to the hyperplane given as g(x)=0
g ( x ) wT x + b ∗ 1
m r
= = ⇒ r=
w w w
wT x + b =
1
Class ω2 w T x + b =−1 Then the length of m=
2
Feature 1
the margin m is w
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Constrained
Optimization
The best hyperplane – the one that provides the maximum separation – is
therefore the one that maximizes m = 2 w
However, by arbitrarily choosing w we can make its length as small as we want. In fact, w=0
would provide an infinite margin – this is clearly not an interesting – or even viable –
solution. There has to be constraint on this problem.
The constraint comes from the correct classification of all data points, which requires that
( )
yi w T xi + b ≥ 1, ∀i
Therefore, the problem of finding the optimal decision boundary is converted into
the following constrained optimization problem:
1 2
min w
2
( )
subject to yi w T xi + b ≥ 1, ∀i
2
Note that maximizing m = 2 w is equivalent to minimizing w 2
We take the square of the length of the vector, which (along with the ½ factor) does not
change the solution, but makes the process for solution easier.
Among other things, since the function to be minimized is quadratic, it has only a single
(global) minimum.
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
A Primer on
Constrained Optimization
Recall from our previous discussion that constrained optimization can be
solved through Lagrange multipliers:
If we wish to find the extremum of a function f(x) subject to some constraint
g(x)=0, the extremum point x can be found by
1. Form the Lagrange function to convert the problem to an unconstraint problem, where α –
whose value need to be determined – is the Lagrange multiplier
L ( x=
, α ) f ( x) + α g ( x)
2. Solve the resulting unconstrained problem by taking the derivative
∂L ( x, α ) ∂ ∂f ( x ) ∂g ( x )
= ( f ( x) + α g ( x)) = +α
∂x ∂x ∂x ∂x
3. For a point x* to be a solution it must then satisfy:
∂L ( x, α )
∂x
= 0,= ( )
g x∗ 0
x = x∗
∂L ( x, α i ) ∂ n
∂x
=
∂x
f ( x ) + ∑
i =1
α g
i i ( x )
x = x∗
0
=
x = x∗
gi ( x )
= 0=i 1, , n
x =x∗
( )
n
subject to yi w T xi + b ≥ 1, ∀i subject to α i ≥ 0, ∑α y
i =1
i i 0
=
φ( )
Feature 2
Feature 2
φ( )
φ( )
φ( )
φ( ) φ( )
φ( )
φ(.) φ( )
φ( ) φ( )
φ( )
φ( ) φ( ) φ( )
φ( ) φ( )
φ( )
φ( ) φ( ) φ( ) φ( )
RP φ( )
φ( )
Feature 1 Feature 1
(a) (b)
x=[x1 x2] − x − t1 2 − x−t2 2
ϕ1 ( x ) = e t1 = [1 1]T ϕ 2 (x) = e t 2 = [0 0]T
x2 φ2(x)
_ (1,1)
1.0
1 _
0.8
φ(.) 0.6
_
_
0 0.4 (0,1)
_ (1,0)
0.2
x1 (0,0)
0 1 _ φ1(x)
0 | | | | | |
0 0.2 0.4 0.6 0.8 1.0 1.2
(c) (d)
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Another Example
2=i 1 =j 1 =i 1
dimensional space efficiently, n
subject to 0 ≤ α i < C , ∑ α i yi =
0
we do not need to compute i =1
mapping, or any high dimensional
computation for that matter
Many geometric operators such as angles and distances can be expressed in
terms of inner products. Then the trick is simply to find a kernel function K
such that
( ) ( )
K xi , x j = φ ( xi ) φ x j
T
K ( xi , x j )= ( )
d
Polynomial kernel with degree d. 1 + γ xiT x j
The user defines the value of d (and γ), which then controls how large the feature
space dimensionality will be. As seen earlier, a choice of d=2 moves 2-D x into 6-D
z. Similarly using d=3 moves a 2-D x into a 10-D z.
Radial basis (Gaussian) kernel with width σ K ( xi , x j )= exp − xi − x j ( 2
2σ 2 )
The user defines the kernel width σ. This SVM is closely related to the RBF
network. It increases the dimensionality to ∞, as every data point is replaced by a
continuous Gaussian. The number of RBFs and their centers are determined based
on the (number of) support vectors and their values, respectively.
K ( xi , x j ) tanh (κ xiT x j + θ )
Hyperbolic tangent (Sigmoid – Two layer MLP) =
The parameters κ and θ are user defined. The number of hidden layer nodes and
their values are determined based on the (number) of support vectors and their
values, respectively. Then, the hidden – output weights are the Lagrange multipliers
αi. This kernel satisfies Mercer’s conditions only for certain values of κ and θ.
-2
-4
-6
-8
-10
-12
-15 -10 -5 0 5 10
0 0
-1 -2
-4
-2
-6
-3
-8
-4 -10
-12
-5 -15 -10 -5 0 5 10
-9 -8 -7 -6 -5 -4 -3 -2 -1
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Expectation – Maximization
Gaussian Mixture Models
: An extremely versatile optimization algorithm,
EM is an iterative approach that cycles the expectation (E) and
maximization (M) steps, to find the estimates of a statistical model.
Designed for parameter estimation (determining the values of unknown
parameters 𝛉𝛉 of a model), and commonly used in conjunction with other
algorithms, such as k-means,
Gaussian Mixture Models (GMMs), hierarchical mixture of experts, or in
missing data analysis.
In E-step, the expected value of a likelihood function – the figure of merit in
determining the true value of the unknown parameter) is computed, under the
current estimate 𝛉𝛉� of the unknown parameters 𝛉𝛉 (that are to be estimated).
In M step, the new estimate of 𝛉𝛉� is computed such that this new estimate
maximizes the current likelihood. Then, E & M steps are iteratively continued
until convergence.
In GMMs, data are modeled using a weighted combination of Gaussians, and
EM is used to determine the Gaussian parameters, as well as the mixing
coefficients (the mixing eights)
http://en.wikipedia.org/wiki/Pagerank
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
AdaBoost
: Combine the decisions of an ensemble of classifiers to
reduce the likelihood of having chosen a poorly trained classifier.
Conceptually similar to seeking several opinions before making an important
decision.
Based on the premise that there is increased confidence that a decision agreed
upon by many (experts, reviewers, doctors, “classifiers”) is usually correct.
AdaBoost generates an ensemble of classifiers using a given “base model,”
which can be any supervised classifier. The accuracy of the ensemble, based
on weighted majority voting of its member classifiers, is usually higher than
that of a single classifier of that type.
The weaker the base classifier (the poorer its performance), the greater the
impact of AdaBoost.
AdaBoost trains the ensemble members on different subsets of the training data.
Each additional classifier is trained with data that is biased towards those
instances that were misclassified by the previous classifier focus on
increasingly difficult to learn samples.
AdaBoost turns a dumb classifier into a smart one!
There are several difficulties in measuring the true performance of any given classifier, let alone
determining which one is best among many others.
In fact, there is even a theorem, called the No Free Lunch Theorem*, that states that – everything else being
equal, no algorithm is universally better than the others on all possible datasets.
Note that this is just an estimate, since we do not know the performance on the field data.
Assume that the errors have a binomial distribution with parameters (𝑃𝑃𝐷𝐷 , 𝑁𝑁𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡), i.e., the classifier commits an
error with probability 𝑃𝑃𝐷𝐷 = 𝑁𝑁𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒/𝑁𝑁𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡. If 𝑁𝑁𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 > 30, 𝑃𝑃𝐷𝐷 ∗ 𝑁𝑁𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 > 5, and (1 − 𝑃𝑃𝑃𝑃) ∗ 𝑁𝑁𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 > 5, the
binomial distribution can be approximated by Gaussian distribution.
Critical Value zα/2 3.00 2.58 2.33 2.05 2.00 1.96 1.645 1.28 1.00 0.675
𝑃𝑃𝐷𝐷 1 − 𝑃𝑃𝐷𝐷
𝑃𝑃𝐷𝐷 ± 𝑧𝑧𝛼𝛼⁄2 𝜎𝜎𝑃𝑃𝐷𝐷 ≈ 𝑃𝑃𝐷𝐷 ± 𝑧𝑧𝛼𝛼⁄2
𝑁𝑁𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡
But the question remains: What if the test data we picked consisted of unusually easy or unusually
difficult instances? Wouldn’t the performance be different if we had a different test set?
There are two better ways to estimate the true performance:
Shuffle training and testing datasets several times, making sure that they do not overlap. Calculate a
generalization performance for each. Take average of the performances
K-fold cross validation: This is considered as one of the better estimators:
Most classification algorithms can be configured to provide a continuous output that can be interpreted as level
of support given by the classifier for each class.
Sometimes, these supports can also be interpreted as the posterior probability of class 𝜔𝜔𝑐𝑐 , given the observed data: 𝑝𝑝 𝜔𝜔𝑐𝑐 |𝑥𝑥 .
For a two class problem, we have 𝑝𝑝 𝜔𝜔+ |𝑥𝑥 = 1 − 𝑝𝑝 𝜔𝜔− |𝑥𝑥 .
So, how do we make the actual classification decision?
We can certainly choose 𝜔𝜔+ if 𝑝𝑝 𝜔𝜔+ |𝑥𝑥 > 𝑝𝑝 𝜔𝜔− |𝑥𝑥 , and 𝜔𝜔− , otherwise. This is equivalent to setting a decision threshold of
𝜃𝜃 = 0.5, since the decision will be 𝜔𝜔+ if 𝑝𝑝 𝜔𝜔+ |𝑥𝑥 > 𝜃𝜃, and 𝜔𝜔− otherwise.
Now, imagine we vary that threshold between 0 and 1.
If we set 𝜃𝜃 = 0, everything will be classified as 𝜔𝜔+ . Since no instance will be chosen as 𝜔𝜔−, we will have zero false negative,
but also no true negative. This will results in 100% recall (sensitivity) – as all of the actual positives are indeed identified as
positive – but 0% specificity.
Now, if set 𝜃𝜃 = 1, then everything will be classified as 𝜔𝜔− . There will be no false positives, but also no true positives. This will
results in 100% specificity (since all of the actual negatives are indeed classified as negative), but 0% sensitivity.
So, if we vary 𝜃𝜃 between 0 and 1, we can control between whether we want our classifier to be more cautious against false
positives (at the risk of reduced true positives) or more cautious against false negatives (at the risk of reduced true negatives).
Sweeping for 𝜃𝜃 between 0 and 1, and plotting the different values of true positive rate (i.e., sensitivity or recall)
against false positive rate (i.e., 1 − 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠) , we get a curve called receiver operating characteristic (ROC)
curve.
The upper left corner represents perfect classification: 100% TPR and 0 FPR.
Classifiers are usually not perfect, of course, so
Furthermore, because the test does not assume a particular underlying distribution (Gaussian, t-
distribution, 𝜒𝜒 2 , etc.), it is a non-parametric test, which is usually more robust to outliers.
0.9
Matlab code for those who like it
0.8
function [d,labd]=gendatcb(N,a,alpha);
0.7
% N data points, uniform distribution,
% checkboard with side a,
0.6
% rotated at alpha
d=rand(N,2);
0.5
d_transformed=[d(:,1)*cos(alpha)-...
d(:,2)*sin(alpha), ...
0.4
d(:,1)*sin(alpha)+ ...
0.3
d(:,2)*cos(alpha)];
s=ceil(d_transformed(:,1)/a)+...
0.2
floor(d_transformed(:,2)/a);
labd=2-mod(s,2);
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1