0% found this document useful (0 votes)
96 views

Week2 - 2022 - Biological Data Science - Polikar - Traditional Machine Learning Lecture

Uploaded by

roderickvicente
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views

Week2 - 2022 - Biological Data Science - Polikar - Traditional Machine Learning Lecture

Uploaded by

roderickvicente
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 123

S

P&
PR
L
Signal Processing &
Pattern Recognition Laboratory
@ Rowan University
presents

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


S
P&
PR
L
ROBI POLIKAR
SIGNAL PROCESSING & PATTERN RECOGNITION LABORATORY
ROWAN UNIVERSITY, GLASSBORO, NJ, USA
SPPRL logo Copyright © Robi Polikar, 2001
Fundamentals of Machine Learning © Robi Polikar, 2022, Rowan University, Glassboro, NJ
PR
© 2022, All Rights Reserved, Robi Polikar.

These lecture notes are prepared by Robi Polikar. Unauthorized use, including duplication, even in part, is not allowed
without an explicit written permission. Such permission will be given – upon request – for noncommercial educational
purposes if you agree to all of the following:

1. Restrict the usage of this material for noncommercial and nonprofit educational purposes only; AND

2. The entire presentation is kept together as a whole, including this page and this entire notice; AND

3. You include the following link/reference on your site: © Robi Polikar http://users.rowan.edu/~polikar.
Getting Ready
 If you do not already have Matlab, and have not already obtained a 30-day free trial version, go to:
https://www.mathworks.com/campaigns/products/trials.html and download / install Matlab. It is a
large file, so start it now, while we go through some of the basic fundamentals.
 If asked, choose – at a minimum – the following toolboxes to be installed (you can install others as
well, if you wish):
 Parallel Computing Toolbox
 Deep Learning Toolbox
 Statistics and Machine Learning Toolbox
 Optimization Toolbox
 Alternatively, you can buy Student Version of Matlab for $99, which includes Matlab, Simulink
and 10 toolboxes including most of the above (plus $10 for Deep Learning Toolbox)
 Download and unzip the following folder. Place it somewhere (e.g., desktop), where you can
easily find it: https://users.rowan.edu/~polikar/ML_Workshopfiles.zip

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Rowan University

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


PR
Pattern Recognition
 Who are they…?

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Traffic
Signs

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
PR
Machine Learning &
Pattern Recognition
Machine Learning

Engineering &
Computer Science Statistics Neuroscience
Optimization

Artificial Statistical Learning Computational Computational


Intelligence Theory Intelligence Neuroscience

Learning Statistical Pattern Pattern


Bioinformatics
Theory Recognition Recognition

• PAC Learning •Bayesian learning •Feature extraction


• Mathematical models of
• Machine learning of • Density estimation • Supervised and Neuron, memory and
define concepts with unsupervised learning learning
• Prediction / Regression
provable guarantees of • System identification • Bioinformatics, genetic
performance • Bayes classifiers data analysis /
• Kernel methods
• Intelligent agents • Resampling techniques metagenomics / protein
• Neural networks folding, etc.
• Ensemble learning
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
PR
Machine Learning
 Informal definition: Automated analysis of – typically large volumes of – data in search of hidden
structures / patterns / information based on training with prior representative data.
 Pattern recognition: Classification of objects into (predefined) categories or classes
• Given data, assign labels (categories) that identify the correct class
• Identify the input / output relationship (mapping) of an unknown system (system identification)
• At the most basic level, given data, answer the question “what is this?”

 Formal definitions
 Machine learning is the “field of study that gives computers the ability to learn [from data] without being explicitly
programmed [for that task]. (A. Samuel, 1959)
 A computer program is said to learn from experience E [data] with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured by P, improves with experience E. (T. Mitchell, 1997)
 Machine learning is the study of computer algorithms that improve automatically through experience and by the use
of data. It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data,
known as "training data", in order to make predictions or decisions without being explicitly programmed to do so.
(Wikipedia 2021)
 Also related to Data Mining, which is involved with discovery of hidden and previously unknown
patterns and properties of the data.
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Applications
 Applications – a very brief / abbreviated list
 Speech recognition / speaker identification / natural language processing
 Handwritten character recognition
 Financial data analysis / stock picking
 Weather prediction / hurricane path prediction
 Fingerprint identification
 DNA sequence / phylogenetics / protein folding
 Radar tracking, friend-foe identification
 Biometrics / Iris scan identification
 Topographical remote sensing
 Text mining / web mining
 Search engine algorithms
 Energy pricing / demand prediction
 Malware detection
 Cyber / infrastructure security
 Self driving vehicles
 Recommender systems
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Machine Learning
Subfields & Application Domains

https://www.wordstream.com/blog/ws/2017/07/28/machine-learning-applications
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Fundamentals of Machine Learning

Auslander, N.; Gussow, A.B.; Koonin, E.V. Incorporating Machine Learning into Established
Applications of ML
in Bioinformatics

Bioinformatics Frameworks. Int. J. Mol. Sci. 2021, 22, 2903. https://doi.org/10.3390/ijms22062903


© 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Fundamentals of Machine Learning

Auslander, N.; Gussow, A.B.; Koonin, E.V. Incorporating Machine Learning into Established
Common ML
Algorithms

Bioinformatics Frameworks. Int. J. Mol. Sci. 2021, 22, 2903. https://doi.org/10.3390/ijms22062903


© 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Terminology
 Classification: Identification (assigning a label) of a particular object to its correct category based on the
(features of) data collected from that object.
 Classify handwritten objects into one of
previously fixed set of alpha-numerical characters B
 Loan / credit applications

• Two classes of applicants: low-risk / high-risk


• Classification can be in the form of a rule,
a discriminant or a decision boundary:

IF income > θ1 AND savings > θ2


THEN low-risk ELSE high-risk

A
A E. Alpaydin, Introduction to Machine Learning, MIT Press, 2004
B C. Bishop, Machine Learning & Pattern Recognition, Springer, 2006
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Terminology
 Clustering: Given data of objects obtained from an unknown number and nature of categories, grouping of
such data into clusters based on some measure of similarity.

 Data mining: Given large volumes of data obtained from the web pages, group the corresponding web pages into logically
meaningful sets (e.g., news articles, shopping sites, medical information, gaming site, other commercial sites, etc.)
 Given large number of sequences, group them into taxonomical classes

AT content
Unlabeled data points:
Each dot represents one sequence

CG content RP

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Terminology
 Supervised learning: Given training data with previously labeled classes, learn the mapping between the
data and their correct classes.
 Associated with “classification,” typically involves adaptively changing the parameters of a model (classifier) until the
model output fits the data
 Unsupervised learning: Given unlabeled data obtained from unknown number of categories, learn how to
group such data into meaningful clusters based on some measure of similarity. Also known as knowledge
discovery
 Typically associated with “clustering” and “density estimation”
 Semi-supervised learning: Similar to supervised learning, but with very little labeled data and lots of
unlabeled data.
 Reinforcement learning: Given a sequence of outputs, learn a policy to obtain the desired output,
maximizing a long-term reward . Typically associated with credit assignment and game playing problems
 Learn how to play chess (or any other game) – Given only the rules of the game (how different pieces can move) and the
final outcome of the game (you won or you lost), learn the objective and strategies of playing chess.
 No single good move - game is won, if the sequence of moves are collectively good!
• Another example: autonomous driving / navigation

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Terminology
 Prediction: Given historical data obtained from previous

Courtesy Rasmussen, Gaussian Proccesses


behavior of a system / object, predict the future
behavior of the same object.

 Regression: Given data obtained from an object / system


at discrete time points, predict (estimate) the behavior at
other (unobserved) time points. Regression is essentially 𝑦𝑦
curve fitting, and is used to determine unknown
functions, 𝑦𝑦 = 𝑓𝑓(𝑥𝑥) from its known samples {𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 }.
Hence, regression solves system identification problems.

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Terminology
 Feature: a variable (predictor) believed to carry discriminating and characterizing information about the
objects under consideration
 Feature vector: A collection of d features, ordered in some meaningful way into a d-dimensional column
vector, that represents the signature of the object to be identified.
 Feature space: The d-dimensional space in which the feature vectors lie. A d-dimensional vector in a d-
dimensional space constitutes a point in that space.

 x1  feature 1 x2
x 
feature 2
x =  2 
 x
 
 xd  feature d
x1
 56  # of ACGA per million bp
 80  # of CCGT per million bp
x= 
120  # of ATCG per million bp
  x3 Feature space (3D)
 220  # of GGAT per million bp
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Terminology
 Class: The category to which a given object belongs, typically denoted by ω
 Pattern: A collection of features of an object under consideration, along with the correct class information of
that object. In classification, a pattern is a pair of variables, 𝒙𝒙, 𝜔𝜔 where 𝒙𝒙 is the feature vector and 𝜔𝜔 is the
corresponding label
 Instance/ Exemplar: Any given example pattern of an object
 Decision boundary: A boundary in the d-dimensional feature space that separates patterns of different
classes from each other.
Sensor 2 / Feature2
 Training Data: Data used during training of a classifier for
which the correct labels are a priori known
Test / Validation Data: Data not used during training, but rather
set aside to estimate the true (generalization) performance of a
classifier, for which correct labels are also a priori known
Field Test Data: Unknown data to be classified for which the
classifier is ultimately trained. The correct class labels for these
data are not known a priori.
Sensor 1 / Feature 1
Exemplars / patterns / measurements from
class 1 class2 class 3 class 4
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Terminology
 Cost Function: A quantitative measure that represents the cost of making an error. The classifier is trained to minimize
this function.
 Classifier: A parametric or nonparametric model which adjusts its parameters or weights to find the correct decision
boundaries through a learning algorithm using a training dataset – such that a cost function is minimized.
 Model: A simplified mathematical / statistical construct that mimics (acts like) the underlying physical phenomenon that
generated the original data.
 Parametric Model: A probabilistic / statistical model that assumes that the underlying phenomenon follows a specific
known probability distribution. The parameters of such a model are the parameters of the distribution.
 A classifier based on determining the parameters of a distribution is also called a generative model as the underlying
distribution can be generated from the parameters.
 Examples: Bayes classifier, expectation-maximization algorithm.
 Nonparametric model: A model that does not assume a specific distribution, and that typically follows an optimization
algorithm to minimize error.
 A classifier based on using a nonparametric approach is also called a discriminative model , as the decision is then based
on a discriminant (or discriminant function).
 Examples: Neural networks, decision trees, support vector machines.

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Terminology
 Error: Incorrect labeling of the data by the classifier
 Cost of error: Cost of making a decision, in particular an incorrect one – not all errors are equally costly!
 Training Performance: The ability / performance of the classifier in correctly identifying the classes of the
training data, which it has already seen. It may not be a good indicator of the generalization performance.
 Generalization (Test Performance): The ability / performance of the classifier in identifying the classes of
previously unseen patterns.
 Confusion Matrix: The matrix obtained from test performance of the classifier that shows how many instances
of each class are classified into different classes.

 c11 c12  c1K  cij: Number of class ωi instances


classified as class ωj by the classifier.
c c 
 c2 K 
CM =  21 22

     
  Number of correctly
 cK 1 cK 2  cKK  classified instances
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Types of Classifiers
A selection

 Bayes classifier
 K-means
 Naïve Bayes Classifier
 K-nearest neighbor  Fuzzy C-means
 Discriminant classifiers  Hierarchical clustering
 Linear / quadratic discriminant classifier  Adaptive Resonance Theory (ART)
 Neural networks  Self organized maps (SOMs)
 Probabilistic neural network  ISODATA
 Multilayer perceptron neural network
 Radial basis function neural network
 Hopfield network
 Deep neural networks (convolutional neural nets, autoencoders, etc.)
 Kernel methods:
 Support vector machines
 Decision trees
 Classification and regression tree (CART)
 Random forest

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Cooperative vs.
Uncooperative data

Sensor 2 / Feature2 Sensor 2 / Feature 2

Sensor 1 / Feature 1 Sensor 1 / Feature 1

Exemplars / patterns / measurements from class 1 class2 class 3 class 4

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Good Features vs.
Bad Features
 Ideally, for a given group of patterns coming from the same class, feature values should all
be similar
 For patterns coming from different classes, the feature values should be different.

G R. Gutieerez-Osuna, Lecture Notes, Texas A&M - http://research.cs.tamu.edu/prism/rgo.htm


Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
A Hypothetical Toy Example

 Taxonomic classification
 Given a sequence, which genus does it belong?
 Let’s say we find the following features to be potentially
useful:
• 𝒙𝒙𝟏𝟏 : # of genes per read
• 𝒙𝒙𝟐𝟐 : average gene length per read
 Which one is the better feature?

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Features…?

Adapted from Duda, Hart & Stork, Pattern Classification, 2/e Wiley, 2000
Genus 1 Genus 2
Genus 1 Genus 2

𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐

• Which one is the better feature?


• If you were to make the decision based on the value of a single feature, which one would
you choose and what would the decision value be?
• No value of either feature will “classify” all trials correctly… What to do? D

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Decision Boundaries
 How to choose the decision boundary?
𝒙𝒙𝟏𝟏 𝒙𝒙𝟏𝟏
Genus 1 Genus 2 Genus 1 Genus 2 

𝒙𝒙𝟐𝟐 𝒙𝒙𝟐𝟐
𝒙𝒙𝟏𝟏
Which of the boundaries would you choose? Genus 1 Genus 2

 Simple linear boundary – training error > 0
 Nonlinear complex boundary – tr. error = 0
 Simpler nonlinear boundary – tr. error > 0 D

𝒙𝒙𝟐𝟐
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Overfitting
 Overfitting: Using too complex of a model (with many adjustable parameters) to explain a
relatively simple underlying phenomenon
 Results in learning the noise in the data
M
y ( x, w ) = w0 + w1 x + w2 x +  + wM x =
2 M
∑w x
i =0
i
i

B
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
PR Occam’s Razor

“Entities are not to be multiplied without necessity.”

William of Occam
(1288-1348)

Original in latin: Entia non sunt multiplicanda praeter necessitatem


Also appearing as: Pluralitas non est ponenda sine neccesitate
Literally: Plurality should not be posited without necessity.
(This is a philosophical argument, and not a scientific one)
Error?
 A classifier, intuitively, is designed to minimize classification error, i.e., the total number of instances classified
incorrectly.
 Is this the best objective (cost) function to minimize?
 What kinds of error can be made? Are they all equally bad? What is the real cost of making an error?
• Perhaps in the genus prediction problem the cost of error is equal in both cases – what is the big deal, you may ask,
whether we misclassify a fox fossil as a wolf?
• But, what if the classification problem was for the genetic risk factor for breast cancer based on some genes?
 We may want to adjust our decision boundary to minimize overall risk – in this case, second type error is more costly, so we may want
to minimize this error.

𝒙𝒙𝟏𝟏

Consider a similar risk assessment for


malignant/benign tumor classification?
Which error is more costly…?

D
𝒙𝒙𝟐𝟐
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
A Complete
ML System

What is the cost of making a decision – what is the risk of an


incorrect decision? How to choose a classifier that minimizes the risk
and maximizes the performance? Can we add prior knowledge, Post-processing
context to the classifier?

Determine the right type of classifier, right type of training


algorithm and right type of classifier parameters Classification

Identify and extract features relevant to distinguishing left from right ,


but invariant to noise, occlusion and irrelevant transformations – what Feature Extraction
if we already have irrelevant features? Dimensionality reduction /
feature selection?

Preprocessing: Remove noise from the data Preprocessing

Data Acquisition

D
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Let’s Ponder…
 Let’s ponder on that question “Determine the right type of classifier” for a little bit…
 Of the many, many classification algorithms, which is the best one?
 If you are stranded in a desert island and can only bring one classification algorithm, what would that
be?

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


PR
Bayes Rule
 We pose the following question: Given that the event A (e.g., observation of some data) has occurred, what is the probability
that any single one of the event B’s occur (the correct class is one of the category choices)?

𝑃𝑃 𝐴𝐴 ∩ 𝐵𝐵𝑗𝑗 𝑃𝑃(𝐴𝐴|𝐵𝐵𝑗𝑗 ) ⋅ 𝑃𝑃(𝐵𝐵𝑗𝑗 �


𝑃𝑃(𝐵𝐵𝑗𝑗 |𝐴𝐴) = = 𝑁𝑁
𝑃𝑃(𝐴𝐴) �𝑘𝑘=1𝑃𝑃(𝐴𝐴|𝐵𝐵𝑘𝑘 ) ⋅ 𝑃𝑃(𝐵𝐵𝑘𝑘 )

This is known as the Bayes rule, and is one of the most


Rev. Thomas Bayes,
important corner stones of machine learning.
(1702-1761)

The denominator, summation over all values of B, is just a normalization constant, ensuring
that all probabilities add up to 1.
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Bayesian Way
of Thinking
 The classic war in statistics: Frequentists vs. Bayesians
 They cannot even agree on the meaning of probability
• Frequentist: the expected likelihood of an event 𝐴𝐴, over a long run: 𝑃𝑃(𝐴𝐴) = 𝑛𝑛/𝑁𝑁.
• Bayesian: measure of plausibility of an event happening, given an observation providing incomplete data, and
previous (sometimes / possibly subjective) degree of belief (known as the prior, or a priori probability)
 Many phenomena of random nature can be explained by the frequentist definition of probability:
• The probability of hitting the jackpot in lottery;
• The probability that there will be at least 10 non-rainy days in Philadelphia in October;
• The probability that at least one of you is born in July;
• The probability that the sum of two random cards will be 21;
 …but some cannot!
• The probability that a major catastrophe will end life on Earth in the next 100 years ;
• The probability that the conflict in Syria will end by 2025;
• The probability that there will be another major recession in the next 10 years;
• The probability that fossil-based energy will be obsolete in 50 years.
• The probability that there will be another pandemic in the next 20 years
 Yet, you can make approximate estimations of such probabilities
 You are following a Bayesian way thinking to do so
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Bayesian Way
of Thinking
 In Bayesian statistics, we compute the probability based on three pieces of information:
 Prior: Our (subjective?) degree of belief that the event is plausible in the first place.
 Likelihood: The probability of making an observation, under the condition that the event has occurred: how likely
is it to observe what I just observed, if event 𝐴𝐴 did in fact happen (or, how likely is it to observe this outcome, if A [class
𝜔𝜔𝐴𝐴 ] were true). Likelihood describes what kind of data we expect to see in each class.
 Evidence: The probability of making such an observation.

 It is the combination of these three that gives the probability of an event, given that an observation (however incomplete
information it may provide) has been made. The probability computed based on such an observation is then called the
posterior probability.
 Given the observation, Bayesian thinking updates the original belief (the prior) based on the likelihood and evidence.
𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 × 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 =
𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒
 Sometimes, the combination of evidence and likelihood are so compelling that it can overwrite our original belief.
• Recall the fossil-based energy example: If asked in early 1900s, such a phenomenon is not on anyone’s radar  prior: very low
(near zero). In 2022, we now have alternate energy sources (solar, wind, etc.) Likelihood: 𝑃𝑃 𝑛𝑛𝑛𝑛 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒):
Very high. This high likelihood trumps our low prior  Posterior: 𝑃𝑃 𝑛𝑛𝑛𝑛 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓|𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 : very high!

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Bayes Classifier
 Statistically, this is the best classifier you can build !!!
 Based on quantifying the trade offs betweens various classification decisions using a probabilistic
approach
 The theory assumes:
 Decision problem can be posed in probabilistic terms
 All relevant probability values are known or can be estimated (in practice this is not true)
 Back to our example: We want to predict genus: Genus 1 (𝜔𝜔1 ) or Genus 2 (𝜔𝜔2 ) or
 Assume that we know the probabilities of each genus, 𝑃𝑃(𝜔𝜔1) and 𝑃𝑃(𝜔𝜔2), for a particular geographical area
• Prior probability
 Based on this information, how would you guess the next fossil DNA’s genus?

𝜔𝜔 = 𝜔𝜔1 𝑖𝑖𝑖𝑖 𝑃𝑃(𝜔𝜔1 ) > 𝑃𝑃(𝜔𝜔2 ) A reasonable


𝜔𝜔 = 𝜔𝜔2 𝑖𝑖𝑖𝑖 𝑃𝑃(𝜔𝜔2 ) > 𝑃𝑃(𝜔𝜔1 ) decision rule ?

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Setting this up
 We now make some observations: we take a read, and look at the number of times two specific k-mers occur in that
read: 14 and 6
 Random variables, say 𝑋𝑋1 (# 𝑜𝑜𝑜𝑜 𝐾𝐾𝐾𝐾𝐾𝐾𝑟𝑟1 ): 𝑥𝑥1 = 14; and 𝑋𝑋2 (# 𝑜𝑜𝑜𝑜 𝐾𝐾𝐾𝐾𝐾𝐾𝑟𝑟2 ): 𝑥𝑥2 = 6;
 How to use this information?
• Assume there are two possibilities, the organism is either genus 1  ω1 or genus 2  ω𝑇𝑇2
𝐱𝐱 = 𝑥𝑥1 , 𝑥𝑥2
 Probabilistically, then
𝜔𝜔 𝑖𝑖𝑖𝑖 𝑃𝑃 𝜔𝜔 = 𝜔𝜔1 |𝑥𝑥1 , 𝑥𝑥2 > 0.5 𝜔𝜔 𝑖𝑖𝑖𝑖 𝑃𝑃 𝜔𝜔 = 𝜔𝜔1 |𝐱𝐱 > 𝑃𝑃 𝜔𝜔 = 𝜔𝜔2 |𝐱𝐱
𝜔𝜔 = � 1       or     𝜔𝜔 = � 1
𝜔𝜔2 𝑖𝑖𝑖𝑖 𝑃𝑃 𝜔𝜔 = 𝜔𝜔2 |𝑥𝑥1 , 𝑥𝑥2 > 0.5 𝜔𝜔2 , otherwise

 So how do we compute 𝑃𝑃(𝜔𝜔1|𝐱𝐱), 𝑃𝑃(𝜔𝜔2|𝐱𝐱)?


• Posterior Probability
 This can be set in the Bayesian framework, which computes the probability conditioned on one variable from the
probability conditioned on the other variable:
𝑝𝑝 𝑥𝑥|𝜔𝜔𝑗𝑗 ⋅ 𝑃𝑃 𝜔𝜔𝑗𝑗 𝑝𝑝 𝑥𝑥|𝜔𝜔𝑗𝑗 ⋅ 𝑃𝑃 𝜔𝜔𝑗𝑗
 But then, what is 𝑃𝑃(𝐱𝐱|𝜔𝜔1)? 𝑃𝑃 𝜔𝜔𝑗𝑗 |𝑥𝑥 = = 𝐶𝐶
𝑝𝑝 𝑥𝑥 ∑𝑘𝑘=1 𝑝𝑝 𝑥𝑥|𝜔𝜔𝑘𝑘 ⋅ 𝑃𝑃 𝜔𝜔𝑘𝑘
• Likelihood

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Bayes Rule
In Pattern Recognition
 Suppose, we know 𝑃𝑃(𝜔𝜔1), 𝑃𝑃(𝜔𝜔2), 𝑃𝑃(𝑥𝑥|𝜔𝜔1) and 𝑃𝑃(𝑥𝑥|𝜔𝜔2), and that we have observed the value of the feature (a random
variable) 𝑥𝑥, say, # of 𝑘𝑘𝑘𝑘𝑘𝑘𝑟𝑟1 occurances, and it is 14 …
 How would you decide on the “state of nature” – the genus, based on this info?
 Bayes theory allows us to compute the posterior probabilities from prior and class-conditional probabilities

Likelihood: The (class-conditional) probability of observing a feature value of x, Prior Probability: The total probability of
given that the correct class is ωj, or what kind of data do we expect to see in class correct class being class ωj determined based
𝜔𝜔𝑗𝑗 . All things being equal, the category with higher class conditional probability on prior experience (before an observation is
made)
is more “likely” to be the correct class.

𝑝𝑝 𝑥𝑥|𝜔𝜔𝑗𝑗 ⋅ 𝑃𝑃 𝜔𝜔𝑗𝑗 𝑝𝑝 𝑥𝑥|𝜔𝜔𝑗𝑗 ⋅ 𝑃𝑃 𝜔𝜔𝑗𝑗


𝑃𝑃 𝜔𝜔𝑗𝑗 |𝑥𝑥 = = 𝐶𝐶
𝑝𝑝 𝑥𝑥 ∑𝑘𝑘=1 𝑝𝑝 𝑥𝑥|𝜔𝜔𝑘𝑘 ⋅ 𝑃𝑃 𝜔𝜔𝑘𝑘

Posterior Probability: The (conditional) probability of correct class Evidence: The total probability of observing the
being ωj, given that feature value x has been observed. Based on the feature value as x. Serves as a normalizing constant,
measurement (observation), the probability of correct class being ωj ensuring that posterior probabilities add up to 1
has shifted from 𝑃𝑃(𝜔𝜔𝑗𝑗) to 𝑃𝑃(𝜔𝜔𝑗𝑗|𝑥𝑥)

A Bayes classifier, decides on the class 𝜔𝜔𝑗𝑗 that has the largest posterior probability.
The Bayes classifier is statistically the best classifier one can possibly construct.
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
How do we compute
Class Conditional Probabilities?
𝜔𝜔1 : Genus 1  𝑃𝑃(𝑥𝑥|𝜔𝜔1 ): Class conditional probability for Genus 1
𝜔𝜔2: Genus 2  𝑃𝑃(𝑥𝑥|𝜔𝜔2 ): Class conditional probability for Genus 2

Likelihood: For example, given that Genus 2 (ω2) is observed, what is the probability we had 12 𝑘𝑘𝑘𝑘𝑘𝑘𝑟𝑟1 occurrences in a given read)?
Or simply, what is the probability that we will identify Genus 2 when there are 12 𝑘𝑘𝑘𝑘𝑘𝑘𝑟𝑟1 occurences?
Or, how likely is it that a Genus 2 organism is associated with 12 𝑘𝑘𝑘𝑘𝑘𝑘𝑟𝑟1 occurrences in any given read?

𝑝𝑝 𝑥𝑥|𝜔𝜔𝑗𝑗 ⋅ 𝑃𝑃 𝜔𝜔𝑗𝑗
𝑃𝑃 𝜔𝜔𝑗𝑗 |𝑥𝑥 =
𝑝𝑝 𝑥𝑥
𝑝𝑝 𝑥𝑥|𝜔𝜔𝑗𝑗 ⋅ 𝑃𝑃 𝜔𝜔𝑗𝑗
= 𝐶𝐶
∑𝑘𝑘=1 𝑝𝑝 𝑥𝑥|𝜔𝜔𝑘𝑘 ⋅ 𝑃𝑃 𝜔𝜔𝑘𝑘

To find the likelihood, let’s approximate


the continuous valued distribution with
a histogram.

This is the kind of data


we expect to see in
class 𝜔𝜔1 and class 𝜔𝜔2
D
# of 𝑘𝑘𝑘𝑘𝑘𝑘𝑟𝑟1
RP 0 3 6 9 12 15 18 21 24 occurrences
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Posterior Probabilities
 Bayes rule allows us to convert the likelihood to posterior probability (difficult to determine) with the help of prior
probabilities and the evidence (easier to determine).
 If in fact we observed 24 𝑘𝑘𝑘𝑘𝑘𝑘𝑟𝑟1 occurances, we can now ask: given that we observed 24 𝑘𝑘𝑘𝑘𝑘𝑘𝑟𝑟1 occurances, what is the
probability that this sequence come from Genus 1? Or Genus 2? The answer is the posterior probability of these classes.
Of course, we choose the class with the larger posterior probability.

Posterior probabilities for priors 𝑃𝑃 (𝜔𝜔1) = 2/3 and 𝑃𝑃(𝜔𝜔2) = 1/3.


For example, given that a pattern is measured to have feature value
𝑥𝑥 = 24, the probability it is in category ω2 is roughly 0.08, and that it
is in ω1 is 0.92. At every x, the posteriors sum to 1.0.
Which class would you choose now?

How good is your decision? What is your probability of making an


error with this decision?
0 3 6 9 12 15 18 21 24

𝑃𝑃(𝜔𝜔1 |𝑥𝑥) if we decide on class 𝜔𝜔2


𝑃𝑃(𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒|𝑥𝑥) = �
𝑃𝑃(𝜔𝜔2 |𝑥𝑥) if we decide on class 𝜔𝜔1
D
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Bayes Decision Rule

Choose 𝜔𝜔𝑖𝑖 if 𝑃𝑃(𝜔𝜔𝑖𝑖 | 𝑥𝑥) > 𝑃𝑃(𝜔𝜔𝑗𝑗 | 𝑥𝑥) for all 𝑖𝑖 = 1,2, … , 𝑐𝑐

If there are multiple features, 𝐱𝐱 = {𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑑𝑑 } 


Choose 𝜔𝜔𝑖𝑖 if 𝑃𝑃(𝜔𝜔𝑖𝑖 | 𝐱𝐱) > 𝑃𝑃(𝜔𝜔𝑗𝑗 | 𝐱𝐱) for all 𝑖𝑖 = 1,2, … , 𝑐𝑐

Choose the class that has


the largest posterior
probability !!!

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Are we done?
 Since we now know the best classifier that can be built, are we done? Can we go home…?
 Not quite: Bayes classifier cannot be used if we don’t know the prob. distributions
 This is typically the rule, not the exception
 In most applications of practical interest, we do not know the underlying distributions
 The distributions can be estimated, if there is sufficient data
 Sufficient ??? Make that “a ton of data” , or better yet… “lots x 10(tons of data)”
 Estimating the prior distribution is relatively easy; however, estimating the class-conditional distribution is
difficult, and it gets very difficult as dimensionality increases…. curse of dimensionality
 If we know the form of the distribution, say normal, but not its parameters, say mean and
variance, the problem reduces to that of parameter estimation from distribution estimation, a
much easier problem.
 If not, there are nonparametric density estimation techniques.

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


The Real World:
The World of Many Dimensions
 In most practical applications, we have more than one feature, and therefore the random variable
𝑥𝑥 must be replaced with a random vector 𝒙𝒙. 𝑝𝑝 𝑥𝑥 → 𝑝𝑝 𝒙𝒙 = 𝑝𝑝 𝑥𝑥1 , 𝑥𝑥2 , … , 𝑥𝑥𝑑𝑑
 The joint probability distribution 𝑝𝑝(𝒙𝒙) still satisfies the axioms of probability
 The Bayes rule is then

𝑝𝑝 𝐱𝐱|𝜔𝜔𝑗𝑗 ⋅ 𝑃𝑃 𝜔𝜔𝑗𝑗 𝑝𝑝 𝐱𝐱|𝜔𝜔𝑗𝑗 ⋅ 𝑃𝑃 𝜔𝜔𝑗𝑗 𝑝𝑝 𝐱𝐱 = 𝑝𝑝 𝑥𝑥1 , 𝑥𝑥2 , ⋯ , 𝑥𝑥𝑑𝑑


𝑃𝑃 𝜔𝜔𝑗𝑗 |𝐱𝐱 = = 𝐶𝐶
𝑝𝑝 𝐱𝐱 ∑𝑘𝑘=1 𝑝𝑝 𝐱𝐱|𝜔𝜔𝑘𝑘 ⋅ 𝑃𝑃 𝜔𝜔𝑘𝑘 𝑝𝑝 𝐱𝐱|𝜔𝜔𝑗𝑗 = 𝑝𝑝 𝑥𝑥1 , 𝑥𝑥2 , ⋯ , 𝑥𝑥𝑑𝑑 |𝜔𝜔𝑗𝑗

𝑑𝑑
 If – and only if – the random variables 𝑝𝑝 𝐱𝐱 = 𝑝𝑝 𝑥𝑥1 ⋅ 𝑝𝑝 𝑥𝑥2 . . . 𝑝𝑝 𝑥𝑥𝑑𝑑 = � 𝑝𝑝 𝑥𝑥𝑖𝑖
in a vector are statistically independent 𝑖𝑖=1

 While the notation changes only slightly, the implications are quite substantial:
 The curse of dimensionality

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


The Curse of
Dimensionality
Remember: In order to approximate the distribution, we need to create a histogram
1-D
On average, let’s say we need 30
instances for each of the 20 bins to adequately
populate the histogram 20*30=600 observations

3-D
2-D RP
20*20*30=12,000! 20*20*20*30=240,000!
observations observations!
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Well…That Sucks!
 Yup! While the Bayes classifier is the best classifier, in most practical applications, you cannot use it.
 Why? Because it requires a lot of data from each feature – something you may not have
 That is why there are scores of other classifiers out there for you to choose – each is generally good for certain
applications and not so optimal for others.
 So, we wasted a perfectly good hour on something we cannot use???
 Not quite.
 We will come back to Bayes Classifier in the second half.

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


PR

X. Wu, V. Kumar, J.R. Quinlan, J. Ghosh, et al., “Top 10 Algorithms in Data Mining,” Knowledge Information Systems, vol. 14, pp. 1-37, 2008.
* C4.5 is listed as one of the top 10 in Wu et al. paper. Dr. Polikar disagrees with this, as C4.5 is a variant of CART. The MLP is a far more
deserving classifier to be in the top 10. Also, note that J. Quinlan, the creator of C4.5, is one of the authors of this paper.
K-Nearest Neighbor
Given a set of labeled training points, a test instance should be given
the label that appears most abundantly in its surrounding.

k=11
Sensor 2 Measurements
(feature 2)

Sensor 1 Measurements
(feature 1)
Measurements from class 1 class2 class 3 class 4

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Naïve Bayes
Given an observation 𝒙𝒙, the correct class 𝜔𝜔 is the
one that maximizes the posterior probability 𝑃𝑃 𝜔𝜔|𝒙𝒙 .
The posterior is computed using the Bayes rule from
• the prior probability 𝑃𝑃 𝜔𝜔 , the
probability of class 𝜔𝜔 occurring in general, and
• the likelihood, 𝑝𝑝 𝒙𝒙|𝜔𝜔 , the probability the observed
specific value of 𝒙𝒙 occurring in class 𝜔𝜔

𝑃𝑃 𝜔𝜔𝑗𝑗 |𝒙𝒙 ∝ 𝑝𝑝 𝒙𝒙|𝜔𝜔𝑗𝑗 ⋅ 𝑃𝑃 𝜔𝜔𝑗𝑗

For 𝒙𝒙 ∈ ℝ𝑑𝑑 , 𝑝𝑝 𝒙𝒙|𝜔𝜔𝑗𝑗 is a 𝑑𝑑-dimensional joint


probability that is difficult to compute for large 𝑑𝑑
However, if features are conditionally independent, i.e.,:
𝑑𝑑

𝑝𝑝 𝐱𝐱|𝜔𝜔𝑗𝑗 = � 𝑝𝑝 𝑥𝑥𝑖𝑖 |𝜔𝜔𝑗𝑗 = 𝑝𝑝 𝑥𝑥1 |𝜔𝜔𝑗𝑗 ⋅ 𝑝𝑝 𝑥𝑥2 |𝜔𝜔𝑗𝑗 ⋅ ⋯ ⋅ 𝑝𝑝 𝑥𝑥𝑑𝑑 |𝜔𝜔𝑗𝑗


𝑖𝑖=1
then 𝑝𝑝 𝒙𝒙|𝜔𝜔𝑗𝑗 is just a product of one-dimensional individual likelihoods,
which is much easier to compute. This is the “naïve” assumption made by NB
If, the distribution form is also assumed (usually Gaussian), then NB
is very easy to implement.

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Naïve Bayes Classifier
(Known Distribution type)
 So given some training data how do we implement the Naïve Bayes classifier?
 If you can assume that the class conditional features follow a specific distribution, e.g., the Gaussian
distribution, then implement the following pseudocode.
Let 𝐱𝐱 = {𝑥𝑥1 𝑥𝑥2 … 𝑥𝑥𝑑𝑑} be the training data
for j=1,…,c
for i=1,…,d
1 1 2
Compute the mean of the xi for all instances from class ωj : 𝜇𝜇𝑖𝑖𝑖𝑖 = ∑ 𝑥𝑥 , 𝜎𝜎𝑖𝑖𝑖𝑖 = ∑ 𝑥𝑥𝑖𝑖 − 𝜇𝜇𝑖𝑖𝑖𝑖 (1)
𝑁𝑁𝑖𝑖𝑖𝑖 𝑥𝑥∈𝜔𝜔𝑗𝑗 𝑖𝑖 𝑁𝑁𝑖𝑖𝑖𝑖 𝑥𝑥∈𝜔𝜔𝑗𝑗
Let 𝐳𝐳 = {𝑧𝑧1 𝑧𝑧2 … 𝑧𝑧𝑑𝑑 } be the test data to be classified. For each z
for j=1,…,c
for i=1,…,d
Compute class conditional likelihood distributions of each feature for each class:
2
𝑝𝑝 𝑥𝑥𝑖𝑖 |𝜔𝜔𝑗𝑗 = 1⁄ 2𝜋𝜋𝜎𝜎𝑖𝑖𝑖𝑖 exp − 𝑥𝑥𝑖𝑖 − 𝜇𝜇𝑖𝑖𝑖𝑖 ⁄ 2 𝜎𝜎𝑖𝑖𝑖𝑖 (2)
Compute the posterior distribution, assuming the class-conditional independence of the features
𝑔𝑔𝑗𝑗𝑁𝑁𝑁𝑁 𝐱𝐱 = 𝑝𝑝 𝜔𝜔𝑗𝑗 ∏𝑖𝑖=1
𝑑𝑑
𝑝𝑝 𝑥𝑥𝑖𝑖 |𝜔𝜔𝑗𝑗 ∝ 𝑝𝑝 𝜔𝜔𝑗𝑗 |𝐱𝐱 (3)
Choose the class for each the posterior distribution / discriminant is the largest
𝜔𝜔∗ = argmax𝑔𝑔𝑗𝑗𝑁𝑁𝑁𝑁 𝐱𝐱 = argmax𝑝𝑝 𝜔𝜔𝑗𝑗 |𝐱𝐱 (4)
𝑗𝑗 𝑗𝑗

 If the likelihoods follow a different, non-Gaussian (but of known type, say chi-square) distribution,
compute the parameters of that distribution for (1) and use the definition of that distribution for (2).
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Naïve Bayes Classifier
(Unknown Distribution type)
 If the form of the distribution is not known, then it must be estimated. Use either a density estimation lecture (such as
Parzen windows, k-nearest neighbor, etc.), or simply follow the histogram approach showed earlier (slide 46):
 For each class, look at the minimum and maximum values of each feature
 Divide that range into a meaningful bins, based on the
number of training data (typically, minimum of 10)
• Optimize the number of bins vs. number of instances in each bin
• The larger the number of bins, the smoother the estimate
• The larger the number of instances in each bin the more accurate the estimate
• For a given dataset size, you can only maximize one or the other.
 Create the histogram by counting the number of instances that fall into each bin
 The distribution is the estimate of the class conditional likelihood of that feature.
 Follow steps (3) and (4) of the pseudo code in the previous slide.

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


K-Means
: Instances that belong to the same class /cluster should “look-
alike”, i.e., be located within proximity of each other.
 K-means iteratively partitions the data into 𝑘𝑘 clusters, each centered around
its cluster center, in such a way that the within-cluster distances (sum of
distances of all instances to their cluster center) – is minimized,
when summed over all clusters.

M K. Murphy, Machine Learning: A Probabilistic Perspective, MIT Press, 2012

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


K-Means

5 1. Random initial choice of the


two means
2. Cluster the data according to
the nearest mean
3. Recalculate the means
Change? Yes
2. Cluster the data according to
the nearest mean
3. Recalculate the means

Continue at home…
0
0 5

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Classification &
Regression Trees
: CART creates a “decision tree,” basically a hierarchical
organization of IF-THEN rules, based on values of each feature.
 The tree starts at Root node , that represents the first question (rule) to be answered by the
tree.
Usually, the root is associated with the most important attribute (feature).
 The root is connected to other internal nodes with directional links, called branches , that
represent the values of the attributes for the node. Each decision made at a node splits the
data through the branches. A leaf (or terminal ) node is where no further split occurs, and
each is associated with a category labels, i.e., class.
 CART progressively evaluates all features, determine the most informative one, and the critical
value to split, that provides best classification. Information theory / entropy-based criteria are
used for this purpose. An example of “Graphical Methods.”

D
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
PR

A Starter Guide for Self-study


Decision Trees
 Decision trees constitute a fundamentally different paradigm for classification, with the following
major differences from other classifiers:

1. Unlike just about every other classification algorithm we have seen so far, decision trees are not based on a
similarity or distance measure
2. Decision trees can handle nominal (categorical), as well as ordinal (ordered) and cardinal (numeric) data,
while most other classifiers require such data to be transformed to ordinal data.
3. The decisions made by a decision tree are intuitive as they represent the information in a hierarchical structure
which can be traced to. Hence, it is easy to determine the reason for a specific decision by tracing that decision
to specific values of the features. Unlike most other classifiers, decision trees are NOT black boxes.
 Decision trees are members of the more general class of models known as Graphical Models, as
they are described using graph theory and graph terminology.

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


A Multi-Class Example
What Fruit Is This?

Categorical and discrete data:


Feature color has the values of
green, yellow and red

The correlation between the class


Traceable decisions: If it is yellow, round
decisions and feature values are not D
and small, it is a lemon!
based on a similarity measure
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Decision Trees
 Decision trees allow us to deal with data that do NOT describe the patterns by vectors of continuously
valued real numbers, but rather by lists of attributes that can be related to a class (hence the name
attribute being used for features).
 Note that such a list allows us not to be bothered by similarity measures: for the feature color, red is no more
closer to yellow than blue.
 So long as there is no duplicate data with opposing class labels, a decision tree can always be built that
will have zero training error, that is, the decision tree can memorize the data
 Hence small changes in attribute values can cause very different classifications. While this is normally not a
desirable property of a classifier, such instability makes decision trees the perfect base classifier for ensemble
systems!

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Components of a
Decision Tree
 A decision tree is a hierarchical series of rules (hence a rule-based classifier), organized as an inverted tree.
As a model, decision trees consists of the following:
 Root node , or just the root, represents the first question (rule) to be answered by the tree. Usually, the root is associated
with the most important attribute (feature).
 The root node is connected to other internal nodes called descendents , with directional links, called branches , that
represent the values of the attributes for the node. Each descendent (child) node is connected to one parent node . Each
decision made at a node splits the data through the branches.
 The part of the tree that follows a node is called a subtree .
 A node that does not have any descendents is called a leaf node (or terminal node ). Each leaf node is associated with
one of the category labels, i.e., classes.
 The links must be mutually distinct and exhaustive: there is always a path that leads to a leaf node (a class) regardless of the
feature (attribute) values, and that path is unique.
 During classification, each internal node is a test on the data instance on one (or more) of the attributes. The tests of a tree
are mutually exclusive and exhaustive (see above). Each branch then corresponds to the outcome of the test at its node.
The final result of the classification is the class label of the leaf node reached through path specified by the successive tests
for the given data instance.

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Interpretability
 Clearly, by far the most important advantage of trees
over other classifiers is interpretability.
 The decision for any pattern is the conjunction of decisions
along the path that leads to its leaf node:
 For a tree with attributes {color, size, shape and taste},
the pattern given by
x ={red, medium, round, sweet} will be classified as apple,
so will x ={red, medium, thin, sour}.
D
 We can also get logical descriptions of classes using conjunctions and disjunctions : An apple is (green
AND medium) OR (red AND medium); Apple can also be described as (medium AND NOT yellow).
 Note that decision trees can naturally use prior knowledge available, though this is often easiest for small
datasets. For large datasets, as we will see, the tree will be created automatically using some tree creation
algorithms, such as CART, ID3, etc.

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


CART
Classification and Regression Trees

 CART1 is a general framework for constructing decision trees.


 The general approach to constructing a tree typically uses the following steps. Given a training dataset D with
correct labels:
1. Find the most informative attribute and the corresponding threshold / critical values to split the parent node.
At first, the parent node is the root node.
2. Find the next most informative attributes, for each value of the parent node, along with their critical values to
split the current node.
• Each such split divides the training data into smaller and smaller subsets.
3. If all remaining data points at the end of the current split are of the same label, then that subset is said to be
pure. The node becomes a leaf node with the associated label.
• If not, either stop splitting and assign a label to the leaf node, accepting an imperfect decision, or
• Select another attribute and continue splitting the data (and growing the tree).
 Hence, tree creation is a recursive process: given the data represented at each node, we either declare that
node to be a leaf node with an associated label, or find another attribute to split the data into further subsets.
1Breiman, L., J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Boca Raton, FL: CRC Press, 1984

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


An Example
decisiontreedemo.m Fisher Iris Dataset

load fisheriris
ctree = ClassificationTree.fit(meas,species, ‘classnames’, {‘setosa’, ‘versicolor’, ‘virginica’}, ‘predictornames’, {'SL', 'SW', 'PL', 'PW'});
view(ctree)
view(ctree,'mode','graph')

Decision tree for classification


1 if PL<2.45 then node 2 elseif PL>=2.45 then node 3 else setosa
2 class = setosa
3 if PW<1.75 then node 4 elseif PW>=1.75 then node 5 else versicolor
4 if PL<4.95 then node 6 elseif PL>=4.95 then node 7 else versicolor
5 class = virginica
6 if PW<1.65 then node 8 elseif PW>=1.65 then node 9 else versicolor
7 class = virginica
8 class = versicolor
9 class = virginica

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


PR
Matlab Apps

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Classification learner
MU =[-1 2 3; 1 3 0];
Sigma1 = [.9 .4; .4 .3];
Sigma2 = [1 0; 0 1];
Sigma3 = [.3 0.8; 0.8 3];
SIGMA{1}=Sigma1; SIGMA{2}=Sigma2; SIGMA{3}=Sigma3;
X=[-5:0.1:5];
Y=[-8:0.1:8];
COUNT=[200 1000 500];
[Gaussian tr_data class tr_labels]=generate_multiple_gauss2d(MU, SIGMA, COUNT);
tr_data=tr_data';
ts_data=create_griddata2(-2.5:0.05:4, -4:0.05:5);
ts_data=ts_data';

N=5000; %Number of data points


a=0.2; %length of each square
alpha=pi/3; % angle of rotation
[tr_data,tr_labels]=generate_checkerboard(N,a,alpha);
tr_data=tr_data';
tr_labels=full(ind2vec(tr_labels'));
tr_labels = vec2ind(tr_labels);
[ts_data,ts_labels]=generate_checkerboard(100000,a,alpha);
ts_data=ts_data';
ts_labels=full(ind2vec(ts_labels'));
ts_labels = vec2ind(ts_labels); load opt_train
load opt_class
load opt_test
load opttest_class

opt_class = vec2ind(opt_class); opttest_class = vec2ind(opttest_class);


Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Classification learner

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Multilayer Perceptron (MLP)
: Mimics the structure of a physiological neural network, a massively
interconnected web of neurons, where each neuron performs a relatively simple
function
 Each neuron (node) computes a weighted sum of its inputs, and then passes that
sum through a nonlinear thresholding function. The neuron “fires” (or not) based
on the output of the thresholding function.
 The optimal weights are determined using a gradient descent optimization.
d input
nodes H hidden J(w)
layer
x1 nodes
c output
nodes J(w1) -∇ J(w1)
x2 z1 J(w2)
Wjk -∇ J(w2)

..

Wij zk J(w3)
....

yj
..

zc

x(d-1)
w1 w2 w3 a
i=1,2,…d
j=1,2,…,H
k=1,2,…c η1 η2
xd
𝑑𝑑 𝐻𝐻

𝑦𝑦𝐽𝐽 = 𝑓𝑓 𝑛𝑛𝑛𝑛𝑡𝑡𝐽𝐽 = 𝑓𝑓 � 𝑤𝑤𝐽𝐽𝐽𝐽 𝑥𝑥𝑖𝑖 𝑧𝑧𝐾𝐾 = 𝑓𝑓 𝑛𝑛𝑛𝑛𝑡𝑡𝐾𝐾 = 𝑓𝑓 � 𝑤𝑤𝐾𝐾𝐾𝐾 𝑦𝑦𝑗𝑗


𝑖𝑖=1 𝑗𝑗=1

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


PR
Neural Networks
(A More Modern Definition)

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Modeling the Neuron:
The Perceptron

Cell body Axon Synaptic


Dendrites terminals
(soma)

x0=1 ∑: Sum of weighted inputs


w0 f: Nonlinear activation
x1
w1
𝒙𝒙 = [𝑥𝑥1, … , 𝑥𝑥𝑑𝑑 ]𝑇𝑇 𝑑𝑑

Σ f

wi y
𝐰𝐰 = 𝑤𝑤1 , … , 𝑤𝑤𝑑𝑑 xi 𝑛𝑛𝑛𝑛𝑛𝑛 = � 𝑤𝑤𝑖𝑖 𝑥𝑥𝑖𝑖 = 𝐱𝐱 ⋅ 𝐰𝐰 𝑇𝑇 + 𝑤𝑤0

𝑖𝑖=0
wd-1 𝑦𝑦 = 𝑓𝑓 𝑛𝑛𝑛𝑛𝑛𝑛
xd-1 𝑓𝑓 ∶ Activation function
wd
RP
xd

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Neural Networks
Physiological Origins

d input
nodes
H hidden
layer nodes
x1
c output
nodes

x2 z1
Wjk

..
Wij zk

....
yj

..
zc


x(d-1)

i=1,2,…d
xd j=1,2,…,H
k=1,2,…c
RP

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


The ANN Training Cycle
Stage 1: Network Training ANN w/ weights to be
determined

Present Examples
Indicate Desired Outputs

Determine
Synaptic
Weights “knowledge”
Stage 2: Network Testing

New Data Predicted Outputs

RP

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


The Multilayer Perceptron
Architecture
d input • What truly separates an MLP from a regular simple perceptron is the
nodes non-linear threshold function f, also known as the activation function. If a
H hidden linear thresholding function is used, the MLP can be replaced with a series
layer nodes of simple perceptron, which can then only solve linearly separable
x1 problems.

c output
nodes
𝑑𝑑

x2 z1
……...
𝑦𝑦𝑗𝑗 = 𝑓𝑓 𝑛𝑛𝑛𝑛𝑡𝑡𝑗𝑗 = 𝑓𝑓 � 𝑤𝑤𝑗𝑗𝑗𝑗 𝑥𝑥𝑖𝑖
𝑖𝑖=1
Wkj

..
……....

netj
RP Wji netk zk
netj yj
𝐻𝐻

..
zc 𝑧𝑧𝑘𝑘 = 𝑓𝑓 𝑛𝑛𝑛𝑛𝑡𝑡𝑘𝑘 = 𝑓𝑓 � 𝑤𝑤𝑘𝑘𝑘𝑘 𝑦𝑦𝑗𝑗

x(d-1) 𝑗𝑗=1

netk

i=1,2,…d
xd j=1,2,…,H
Fundamentals of Machine Learning
k=1,2,…c © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Activation Functions
Threshold (Logic Unit) Linear / Identity Sigmoid Relu*
15 1 15

β = 0.9
β = 0.5
1
10 0.8 β = 0.25

β = 0.1
0.5 5 10

0.6

f(net)
f(net)
f(net)

f(net)
0 0
0.4
-5
5

-0.5
0.2
-10
-1
-15 0 0

-15 -10 -5 0 5 10 15 -15 -10 -5 0 5 10 15 -15 -10 -5 0 5 10 15 -15 -10 -5 0 5 10

net
net net net RP

1, 𝑖𝑖𝑖𝑖 𝑛𝑛𝑛𝑛𝑛𝑛 ≥ 0 1 𝑛𝑛𝑛𝑛𝑛𝑛, 𝑖𝑖𝑖𝑖 𝑛𝑛𝑛𝑛𝑛𝑛 > 0


𝑓𝑓 𝑛𝑛𝑛𝑛𝑛𝑛 = �
0, otherwise
or 𝑓𝑓 𝑛𝑛𝑛𝑛𝑛𝑛 = 𝑛𝑛𝑛𝑛𝑛𝑛 𝑓𝑓(𝑛𝑛𝑛𝑛𝑛𝑛) = 𝑓𝑓 𝑛𝑛𝑛𝑛𝑛𝑛 = �
1 + 𝑒𝑒 −𝛽𝛽⋅𝑛𝑛𝑛𝑛𝑛𝑛 0, otherwise
1, 𝑖𝑖𝑖𝑖 𝑛𝑛𝑛𝑛𝑛𝑛 ≥ 0
𝑓𝑓 𝑛𝑛𝑛𝑛𝑛𝑛 = �
−1, otherwise

* Rectified linear unit

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Computing the Weights
The Gradient Descent
 The secret sauce of a neural network is the set of weights, 𝒘𝒘
 So, how do we find the weights?
 We define a criterion function, 𝐽𝐽(𝒘𝒘), and minimize it such that 𝒘𝒘 is the solution vector
 This reduces the problem of a massive search into a problem of minimizing a scalar function

𝐰𝐰 ∗ = argmin𝐰𝐰 𝐽𝐽 𝐰𝐰

 How do we minimize 𝐽𝐽(𝒘𝒘)? … you ask…


 Start at some arbitrary point 𝒘𝒘𝟏𝟏, and compute the corresponding 𝐽𝐽(𝒘𝒘𝟏𝟏)
 Compute the gradient (the what…?) of 𝐽𝐽(𝒘𝒘𝟏𝟏) with respect to 𝒘𝒘𝟏𝟏  ∇ 𝐽𝐽(𝒘𝒘𝟏𝟏)
 Obtain the next point 𝒘𝒘𝟐𝟐 by moving in the direction of the negative gradient, −∇ 𝐽𝐽(𝒘𝒘𝟐𝟐 ), by some
amount η (the learning rate).
 Repeat until no change

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Gradient Descent
De-Mystified!
scaled, negative 𝜕𝜕𝜕𝜕
𝐰𝐰𝑘𝑘+1 = 𝐰𝐰𝑘𝑘 − 𝜂𝜂𝑘𝑘 ∇𝐽𝐽(𝐰𝐰𝑘𝑘 ) Δ𝑤𝑤𝑖𝑖 = −𝜂𝜂 , ∀𝑖𝑖
J(w) gradient 𝜕𝜕𝑤𝑤𝑖𝑖
𝑤𝑤𝑖𝑖 = 𝑤𝑤𝑖𝑖 + Δ𝑤𝑤𝑖𝑖

J(w1) -∇ J(w1)
𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕
J(w2) ∇𝐽𝐽 𝐰𝐰 = , ,⋯,
-∇ J(w2) 𝜕𝜕𝑤𝑤1 𝜕𝜕𝑤𝑤2 𝜕𝜕𝑤𝑤𝑑𝑑+1

J(w3)
Basic Gradient Descent
Initialize 𝒘𝒘, threshold θ,
η(𝑘𝑘), 𝑘𝑘 = 0;
w1 w2 w3 w
RP do 𝑘𝑘𝑘𝑘 + 1
η1 η2 𝒘𝒘  𝒘𝒘 − η(𝑘𝑘) . ∇𝐽𝐽(𝒘𝒘)
until |η(𝑘𝑘) . ∇𝐽𝐽(𝒘𝒘)| < θ
return 𝒘𝒘
end
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Some Issues to Consider…

 How to choose the learning rate η?


 Too small 𝜂𝜂  Slow convergence
 Too large 𝜂𝜂  overshoot, no convergence (!)

 What should be the criterion function?


 A bad selection of 𝐽𝐽( . ) can ruin an otherwise perfectly perfect algorithm!

 Local / global minimum?

 When to terminate to avoid overfitting?

 There are many forms of gradient descent that address these issues: Newton’s descent,
the momentum term, conjugate gradient algorithm, stochastic gradient descent, etc.,
etc., etc.
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Look Ma…
Not A Single Line of Code

You need to have the training / test data in the


workspace to use this tool! You can use standard
datasets, or your own dataset.

But this app has very limited functionality.


Use Classification Learner, or better yet, command
line options.

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Build-in Datasets

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


MLP Demo
trainmlp_menu_based.m

Spiral data Randomly generated Gaussian data


1
8
6

0.9
6
4
0.8

2 4
0.7

2 0.6
0

0.5
-2 0

0.4

-4 -2
0.3

-6 -4 0.2

0.1
-8 -6
-6 -4 -2 0 2 4 6 8 -4 -3 -2 -1 0 1 2 3 4 5 6
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

MLP based classification of the test data MLP based classification of the test data MLP based classification of the test data
6 5 1

4 0.9

4
3 0.8

2 0.7
2

0.6
1
0
0.5
0

0.4
-2
-1

0.3
-2
-4
0.2
-3
0.1
-6
-6 -4 -2 0 2 4 6 -4
-3 -2 -1 0 1 2 3 4 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Command Line / Scripted Training
 You get a lot more fine-control on network parameters by using the command line functions

% create the network


%train the network
net = feedforwardnet([30], 'traingdx');
[net,train_record,net_outputs,net_errors] = train(net,tr_data,tr_labels);
net.layers{1}.transferFcn='tansig';
net.layers{2}.transferFcn='tansig';
% Simulate the network
if nargin==4
net.divideparam.trainRatio=TR_ratio; % TR data ratio set by user Network_output=net(ts_data);
net.divideparam.valRatio=V_ratio; % Validation data ratio set by user Performance = perform(net, ts_labels, Network_output);
net.divideparam.testRatio=T_ratio; % Test data ratio set by user
%Compute confusion matrix for text display
net.trainParam.epochs = 5000; % Maximum number of epochs to train [Conf_mat Ratio_mat test_perf]=confusion_matrix(Network_output, ts_labels);
net.trainParam.goal = 0.01; % Performance goal
net.trainParam.lr = 0.01; % Learning rate
net.trainParam.lr_inc = 1.05; % Ratio to increase learning rate
net.trainParam.lr_dec = 0.7; % Ratio to decrease learning rate
net.trainParam.max_fail = 50; % Maximum validation failures
net.trainParam.max_perf_inc = 1.04; % Maximum performance increase
net.trainParam.mc = 0.9; % Momentum constant
net.trainParam.min_grad = 1e-10; % Minimum performance gradient
net.trainParam.show = 50; % Epochs between displays (NaN for no displays)
net.trainParam.showCommandLine = 0; % Generate command-line output
net.trainParam.showWindow = 1; % Show training GUI
net.trainParam.time = inf; % Maximum time to train in seconds

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Support Vector Machines
: In a two-class linear classification problem, the best
decision boundary is the one that maximizes the margin between the class
boundaries.
SVM uses quadratic
programming to find

Pattern Recognition, 4/e, Academic Press.


this optimal boundary.
 This may not be too terribly useful,

Theodoridis & Koutroumbas,


since most problems are not linear.

 But, the SVM’s biggest feat is its


transformation, the so-called
“kernel-trick ,” that allows a non-linear problem solved in the high-
dimensional linear space, without doing any high-dimensional calculations!!!

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


PR

A Starter Guide for Self-study


Intuitive Idea
 Given many boundaries that can all solve a given linearly separable problem, which one is the best
– i.e., most likely to result in the smallest generalization error?
 Intuitively, our answer is: the one that provides the largest separation between the classes: this leaves more
room for noisy samples to move around
 The SVM is a classifier that can find such an optimal hyperplane, which provides the maximum margin
between the nearest instances of opposing classes.

Viable decision Optimal decision Support


boundaries boundary vectors

Feature 2
Feature 2

Maximum
RP RP margin

Feature 1 Feature 1
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Maximizing Margins

TK

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Recall: Geometry of a
Decision Boundary
w  w 
x xp + r
= g ( x ) = w x + w0 = w  x p + r
T T
 + w0
w  w 

2
w

wT w
= w x p + w0 + r
T
 w
g ( x p )=0
g (x)
=r w g (x) = r w ⇔ r =
w
Let’s take a look at the value of the
function g(x) at the origin, x=0

r0 w = g ( x ) x =0 = w T x + w0 = w0
w
⇒ r0 =0
w
D
w0 determines the location of the hyperplane
! w determines the direction of the hyperplane
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Formalizing the Problem
 Recall that the separating hyperplane (decision boundary) is given by
(using b instead of w0, as commonly done in SVM literature)
g ( x ) = w T x + w0 = w T x + b
 Given a two-class training data xi with labels yi = +1 and yi= -1. Then,
w T x + b ≥ 1 ⇒ x ∈ ω1
Class ω1
wT x + b =0 w w T x + b = 0 ⇒ x on the boundary
Feature 2

w T x + b ≤ −1 ⇒ x ∈ ω2
r*
( )
yi w T xi + b ≥ 1, ∀i Correct classification
for all points
RP
r Distance of a point x on the margin (where g(x)=1)
to the hyperplane given as g(x)=0

g ( x ) wT x + b ∗ 1
m r
= = ⇒ r=
w w w
wT x + b =
1
Class ω2 w T x + b =−1 Then the length of m=
2
Feature 1
the margin m is w
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Constrained
Optimization
 The best hyperplane – the one that provides the maximum separation – is
therefore the one that maximizes m = 2 w
 However, by arbitrarily choosing w we can make its length as small as we want. In fact, w=0
would provide an infinite margin – this is clearly not an interesting – or even viable –
solution. There has to be constraint on this problem.
 The constraint comes from the correct classification of all data points, which requires that
( )
yi w T xi + b ≥ 1, ∀i
 Therefore, the problem of finding the optimal decision boundary is converted into
the following constrained optimization problem:
1 2
min w
2
( )
subject to yi w T xi + b ≥ 1, ∀i
2
 Note that maximizing m = 2 w is equivalent to minimizing w 2
 We take the square of the length of the vector, which (along with the ½ factor) does not
change the solution, but makes the process for solution easier.
 Among other things, since the function to be minimized is quadratic, it has only a single
(global) minimum.
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
A Primer on
Constrained Optimization
 Recall from our previous discussion that constrained optimization can be
solved through Lagrange multipliers:
 If we wish to find the extremum of a function f(x) subject to some constraint
g(x)=0, the extremum point x can be found by
1. Form the Lagrange function to convert the problem to an unconstraint problem, where α –
whose value need to be determined – is the Lagrange multiplier

L ( x=
, α ) f ( x) + α g ( x)
2. Solve the resulting unconstrained problem by taking the derivative

∂L ( x, α ) ∂ ∂f ( x ) ∂g ( x )
= ( f ( x) + α g ( x)) = +α
∂x ∂x ∂x ∂x
3. For a point x* to be a solution it must then satisfy:

∂L ( x, α )
∂x
= 0,= ( )
g x∗ 0
x = x∗

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Constrained Optimization
with Multiple Constrains
 If we have many constraints, such as gi (= x ) 0,= i 1, , n we need a Lagrange
multiplier αi for each constraint, which then appear as a summation in the
Lagrangian n
, α i ) f ( x ) + ∑ α i gi ( x )
L ( x=
i =1

and we require that the gradient of the Lagrangian be zero

∂L ( x, α i ) ∂  n

∂x
= 
∂x 
f ( x ) + ∑
i =1
α g
i i ( x ) 
 x = x∗
0
=
x = x∗

gi ( x )
= 0=i 1, , n
x =x∗

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


The Dual Problem
1 2
max LD =
1 n n T
n
− ∑∑ α iα j yi y j xi x j + ∑ α i ,
min w 2=i 1 =j 1
2 =i 1

( )
n

subject to yi w T xi + b ≥ 1, ∀i subject to α i ≥ 0, ∑α y
i =1
i i 0
=

 Few points worth mentioning:


 This is called the Wolf dual representation. Wolf has proved that the w (and b) that minimize
L are the same that maximize the dual Lagrangian LD with respect to αi.
 The dual problem does not depend on w or b. Hence, it is easier to solve. Even the n
constraints are (fewer and) easier. Once αi are obtained, w can be found from w = ∑ α i xi yi
and b can be found from any support vector xi on the margin satisfying yi ( wT xi + b ) =
i =1
1
 The original (primal) problem increases its complexity (scales) as dimensionality increases, as
there is one additional parameter in w for each dimension. However, the dual problem is
independent of w, and hence of dimension. It only scales with number of data points – there
is one Lagrange multiplier for each data point.
 Most importantly, the training data only appears as dot products xTx in the dual formulation.
As we will seen later, this is going to have a profound effect when we deal with nonlinear
problems (kernel trick)
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Support Vectors
 Once again, let’s remember the KKT conditions
w x+b =
T
0 w
wT x + b =
1 ( )
α i 1 − yi ( wT xi + b ) =
0 1 − yi ( w T xi + b ) ≤ 0 αi ≥ 0
Feature 2

 The first equation states that either αi = 0, or


1 − yi ( w T xi + b ) is zero (or both).
Therefore, if 1 − yi ( wT xi + b ) is not zero, that is, if the xi
is not on the margin, then the corresponding Lagrange
multiplier must necessarily be zero!
 For those xi that do lie on the hyperplane, αi > 0 , in
which case those points define the hyperplane, and
RP w T x + b =−1 hence are called support vectors .
 It is possible for both conditions to be satisfied at zero,
Feature 1
that is αi = 0 for those points that do lie on the margin.
These points are not considered as support vectors,
since they are not required to define the hyperplane.
Hence, we could replace the entire dataset with the few support vectors we find by solving the
optimization problem. Only the support vectors matter for determining the optimal hyperplane,
the rest of the data points might as well be thrown away.
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
How to Solve…?
Quadratic Programming

 So far, we have mentioned that we simply solve the quadratic Lagrangian


along with the constraints. We have obtained a simpler (dual) form with
simpler constraints, but exactly how is minimization obtained?
 This is a quadratic programming (QP) problem which is solved through
iterative (and numerical) optimization techniques
 There are many alternatives, see http://www.numerical.rl.ac.uk/qp/qp.html
 Typically, such techniques are interior point algorithms, which starts with an initial
solution that violates the constraints.
• Iteratively improve the solution by reducing the amount of constraint violation
 A popular technique is sequential minimal optimization (SMO) , whose details
are beyond the scope of this class. However, it basically based on the following:
• A QP problem with only two variables can be analytically solved (even by hand).
• SMO picks a pair of αi αj at each iteration and solves QP with these two variables. It then
repeats until convergence, i.e., all KKT conditions / constraints are met.
 Matlab’s optimization toolbox has a quadprog() function
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
How About
Nonlinear Case…?
 Note that SVM is essentially a linear classifier. Even with the slack variables,
it still finds a linear boundary between the classes.
 What if the problem is fundamentally a non-linearly separable problem?
 Perhaps one of the most dramatic twists in pattern recognition allows the
modest linear SVM classifiers to turn into one of the strongest nonlinear
classifiers.
 Cover’s theorem: A complex problem that is not linearly separable in the given
input space is more likely to be linearly separable in a higher dimensional space
• Input space: the space in which the given training data points xi reside
• Feature space: A higher dimensional space, obtained by transforming xi through a (kernel)
transformation function φ(xi)
 Hence SVMs solve a nonlinear problem by:
• Perform a nonlinear mapping from the input space to the higher dimensional space that is
hidden from both the input and output
• Construct an optimal (linear) hyperplane in the high dimensional space.

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


An Example

φ( )

Feature 2
Feature 2
φ( )
φ( )
φ( )
φ( ) φ( )
φ( )
φ(.) φ( )
φ( ) φ( )
φ( )
φ( ) φ( ) φ( )
φ( ) φ( )
φ( )
φ( ) φ( ) φ( ) φ( )
RP φ( )
φ( )

Feature 1 Feature 1
(a) (b)
x=[x1 x2] − x − t1 2 − x−t2 2
ϕ1 ( x ) = e t1 = [1 1]T ϕ 2 (x) = e t 2 = [0 0]T

x2 φ2(x)
_ (1,1)
1.0
1 _
0.8
φ(.) 0.6
_
_
0 0.4 (0,1)
_ (1,0)
0.2
x1 (0,0)
0 1 _ φ1(x)
0 | | | | | |
0 0.2 0.4 0.6 0.8 1.0 1.2
(c) (d)
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Another Example

Original example by Scholkoph & Smola, figure from Gutierrez


Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Curse of Dimensionality…?
 An immediate concern that comes to mind is the difficulty of the solving an
optimization problem in a higher dimensional space.
 In fact, the higher dimensional space can even be of infinite dimension.
 How does one compute anything in infinite dimensional space?
 SVM pulls out its final and most effective trick ever: the kernel trick
 Recall that the data points in the Lagrangian appeared in dot (inner) products only
 So long as we can compute the 1 n n n

inner product in the higher max LD = − ∑∑ α iα j yi y j xi x j + ∑ α i ,


T

2=i 1 =j 1 =i 1
dimensional space efficiently, n
subject to 0 ≤ α i < C , ∑ α i yi =
0
we do not need to compute i =1
mapping, or any high dimensional
computation for that matter
 Many geometric operators such as angles and distances can be expressed in
terms of inner products. Then the trick is simply to find a kernel function K
such that
( ) ( )
K xi , x j = φ ( xi ) φ x j
T

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Available Kernels

K ( xi , x j )= ( )
d
 Polynomial kernel with degree d. 1 + γ xiT x j
 The user defines the value of d (and γ), which then controls how large the feature
space dimensionality will be. As seen earlier, a choice of d=2 moves 2-D x into 6-D
z. Similarly using d=3 moves a 2-D x into a 10-D z.
 Radial basis (Gaussian) kernel with width σ K ( xi , x j )= exp − xi − x j ( 2
2σ 2 )
 The user defines the kernel width σ. This SVM is closely related to the RBF
network. It increases the dimensionality to ∞, as every data point is replaced by a
continuous Gaussian. The number of RBFs and their centers are determined based
on the (number of) support vectors and their values, respectively.
K ( xi , x j ) tanh (κ xiT x j + θ )
 Hyperbolic tangent (Sigmoid – Two layer MLP) =
 The parameters κ and θ are user defined. The number of hidden layer nodes and
their values are determined based on the (number) of support vectors and their
values, respectively. Then, the hidden – output weights are the Lagrange multipliers
αi. This kernel satisfies Mercer’s conditions only for certain values of κ and θ.

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


trainsvm_menu_based.m
SVM Demo
8
1 (training)
6 1 (classified)
2 (training)
4
2 (classified)
Support Vectors
2

-2

-4

-6

-8

-10

-12
-15 -10 -5 0 5 10

SVM based classification of the test data


3 8
1 (training)
1 (classified) 6
2
2 (training)
4
2 (classified)
1
Support Vectors 2

0 0

-1 -2

-4
-2
-6
-3
-8

-4 -10

-12
-5 -15 -10 -5 0 5 10
-9 -8 -7 -6 -5 -4 -3 -2 -1
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Expectation – Maximization
Gaussian Mixture Models
: An extremely versatile optimization algorithm,
EM is an iterative approach that cycles the expectation (E) and
maximization (M) steps, to find the estimates of a statistical model.
 Designed for parameter estimation (determining the values of unknown
parameters 𝛉𝛉 of a model), and commonly used in conjunction with other
algorithms, such as k-means,
Gaussian Mixture Models (GMMs), hierarchical mixture of experts, or in
missing data analysis.
 In E-step, the expected value of a likelihood function – the figure of merit in
determining the true value of the unknown parameter) is computed, under the
current estimate 𝛉𝛉� of the unknown parameters 𝛉𝛉 (that are to be estimated).
 In M step, the new estimate of 𝛉𝛉� is computed such that this new estimate
maximizes the current likelihood. Then, E & M steps are iteratively continued
until convergence.
 In GMMs, data are modeled using a weighted combination of Gaussians, and
EM is used to determine the Gaussian parameters, as well as the mixing
coefficients (the mixing eights)

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


A Priori
: Evaluate a dataset of lists to determine which items
appear together, i.e., learns the associations among the items in the
lists.
 A priori is a breath-first search using hash-steps to quickly search
large datasets.
 It is an iterative search: start with 1 item, whose frequency of
occurrence exceeds a threshold, called the minimum support. Then
find all pairs of items that include the single items (called the
candidate lists), and scan the dataset to determine those pairs whose
frequency of occurrence exceeds the threshold. Continue with
triplets, quadruplets, etc.
 The fundamental premise: Any item or list of items whose frequency
of occurrence fall below the threshold, cannot be part of a superset
that includes these items. This is how A priori limits the search space.

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


PageRankTM
: The importance of a webpage is proportional to the links that point to it by other web pages, as well
as the importance of those web pages. A page 𝑃𝑃 that receives links from many webpages gets a higher PageRank. If those
links are coming from pages with high PageRank themselves, then 𝑃𝑃
receives even a higher PageRank.
 This is the original algorithm used by Google for ranking its search results.
Currently, it is only part of the (undisclosed) algorithm used
by Google.
 PageRank is named after its inventor Larry Page. The fact that it is a
“page ranking” algorithm is a convenient coincidence.

See US Patent #6,285,999


http://www.google.com/patents?vid=6285999

http://en.wikipedia.org/wiki/Pagerank
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
AdaBoost
: Combine the decisions of an ensemble of classifiers to
reduce the likelihood of having chosen a poorly trained classifier.
 Conceptually similar to seeking several opinions before making an important
decision.
 Based on the premise that there is increased confidence that a decision agreed
upon by many (experts, reviewers, doctors, “classifiers”) is usually correct.
 AdaBoost generates an ensemble of classifiers using a given “base model,”
which can be any supervised classifier. The accuracy of the ensemble, based
on weighted majority voting of its member classifiers, is usually higher than
that of a single classifier of that type.
 The weaker the base classifier (the poorer its performance), the greater the
impact of AdaBoost.
 AdaBoost trains the ensemble members on different subsets of the training data.
Each additional classifier is trained with data that is biased towards those
instances that were misclassified by the previous classifier  focus on
increasingly difficult to learn samples.
 AdaBoost turns a dumb classifier into a smart one!

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


PR
Yes…yes…but,
Which is the Best?
 Technically, the Bayes classifier (not the Naïve Bayes classifier) is statistically the best classifier.
 However, it is impractical to compute it in most cases
 The performance of any given classifier depends very much on the data on which it was trained.

 There are several difficulties in measuring the true performance of any given classifier, let alone
determining which one is best among many others.
 In fact, there is even a theorem, called the No Free Lunch Theorem*, that states that – everything else being
equal, no algorithm is universally better than the others on all possible datasets.

See https://en.wikipedia.org/wiki/No_free_lunch_theorem and then https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_optimization and then


http://ti.arc.nasa.gov/m/profile/dhw/papers/78.pdf

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Evaluating
Classifier Performance
 How do we determine the true performance of the designed classifier out in the field?
 This is not as easy as one may think.
 Standard procedure: Calculate empirical performance on the previously unseen test data (also called counting
estimator)
𝑁𝑁𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑁𝑁𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 = 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 =
𝑁𝑁𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑁𝑁𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡

 Note that this is just an estimate, since we do not know the performance on the field data.
 Assume that the errors have a binomial distribution with parameters (𝑃𝑃𝐷𝐷 , 𝑁𝑁𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡), i.e., the classifier commits an
error with probability 𝑃𝑃𝐷𝐷 = 𝑁𝑁𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒/𝑁𝑁𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡. If 𝑁𝑁𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 > 30, 𝑃𝑃𝐷𝐷 ∗ 𝑁𝑁𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 > 5, and (1 − 𝑃𝑃𝑃𝑃) ∗ 𝑁𝑁𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 > 5, the
binomial distribution can be approximated by Gaussian distribution.

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Calculating
Classifier Performance
 Then, we can calculate a confidence interval for this probability of error:
Confidence Level (%) 99.73 99 98 96 95.45 95 90 80 68.27 50

Critical Value zα/2 3.00 2.58 2.33 2.05 2.00 1.96 1.645 1.28 1.00 0.675

𝑃𝑃𝐷𝐷 1 − 𝑃𝑃𝐷𝐷
𝑃𝑃𝐷𝐷 ± 𝑧𝑧𝛼𝛼⁄2 𝜎𝜎𝑃𝑃𝐷𝐷 ≈ 𝑃𝑃𝐷𝐷 ± 𝑧𝑧𝛼𝛼⁄2
𝑁𝑁𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡

 But the question remains: What if the test data we picked consisted of unusually easy or unusually
difficult instances? Wouldn’t the performance be different if we had a different test set?
 There are two better ways to estimate the true performance:
 Shuffle training and testing datasets several times, making sure that they do not overlap. Calculate a
generalization performance for each. Take average of the performances
 K-fold cross validation: This is considered as one of the better estimators:

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Cross-Validation
Block 1 Block 2 Block 3 Entire
…Data … Block K-1 Block K

Training blocks Test block


in blue in yellow

Block 1 Block 2 Block 3 … … Block K-1 Block K

Block 1 Block 2 Block 3 … … Block K-1 Block K

Block 1 Block 2 Block 3 … … Block K-1 Block K


. . . .
. . . .
. . . .

Block 1 Block 2 Block 3 … … Block K-1 Block K

Block 1 Block 2 Block 3 … … Block K-1 Block K


Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Cross-Validation
 Divide the entire dataset 𝒟𝒟 consisting of 𝑁𝑁 instances into 𝐾𝐾 disjoint (mutually exclusive) sets of
equal size, 𝑁𝑁/𝐾𝐾
 Call each of these 𝐾𝐾 datasets, hold-out sets.
 Train the classifier 𝐾𝐾 times, with the entire dataset, minus one of the hold-out sets.
 Test each classifier on its own hold-out test. This gives 𝐾𝐾 error rates, or 𝐾𝐾 generalization
performance
 Compute confidence interval based on 𝑧𝑧 or 𝑡𝑡 test (depending on the number of blocks) using the 𝐾𝐾
generalization performances Sample std. dev
of performance
𝑠𝑠 𝑠𝑠 𝑠𝑠
𝑃𝑃𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 = 𝑝𝑝̅ − 𝑡𝑡𝛼𝛼⁄2,𝐾𝐾−1 ⋅ , 𝑝𝑝̅ + 𝑡𝑡𝛼𝛼⁄2,𝐾𝐾−1 ⋅ = 𝑝𝑝̅ ± 𝑡𝑡𝛼𝛼⁄2,𝐾𝐾−1 ⋅ Average performance
𝐾𝐾 𝑛𝑛 𝐾𝐾 from K trials

 The mean of K generalization performances is considered as a good estimate of the true


generalization performance of the classifier.
 In the limiting case of 𝑁𝑁 = 𝐾𝐾, cross validation is called the leave-one-out procedure.

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Figures of Merit
 Classification accuracy, as measured by the ratio of correct classification is not the only (nor – in some
applications) the most desirable metric. For two-class problems, in particular, we can look at a variety
of metrics.
↓ Predicted by the classifier ↓
True Class ↓ Class 1: 𝜔𝜔+ Class 2: 𝜔𝜔−
𝜔𝜔+ TP: True positive FN: False negative
𝜔𝜔− FP: False positive TN: True negative

• Accuracy = # of correct classification/ # of instances = (𝑇𝑇𝑇𝑇 + 𝑇𝑇𝑇𝑇) / 𝑁𝑁


• Error rate = # of errors / # of instances = (𝐹𝐹𝐹𝐹 + 𝐹𝐹𝐹𝐹) / 𝑁𝑁
• Recall (Sensitivity, Hit rate, TPR) = # of true positives / # of total positives = 𝑇𝑇𝑇𝑇 / (𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹) (how many of the
actual positive class instances are detected by the classifier as positive?)
• Precision (PPV) = # of true positives / # of all positive found = 𝑇𝑇𝑇𝑇 / (𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹)
(Of all instances identified as positive, how many are really positive?)
• Specificity (True Negative Rate) = 𝑇𝑇𝑇𝑇 / (𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹)
(how many of the actual negative class instances are detected by the classifier as negative?)
• False alarm (positive) rate = 𝐹𝐹𝐹𝐹 / (𝐹𝐹𝐹𝐹 + 𝑇𝑇𝑇𝑇) = 1 − 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆
• F1 score = 2𝑇𝑇𝑇𝑇/(2𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹 + 𝐹𝐹𝐹𝐹) harmonic mean of precision and recall.
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Receiver Operating Characteristic
(ROC) Curve

 Most classification algorithms can be configured to provide a continuous output that can be interpreted as level
of support given by the classifier for each class.
 Sometimes, these supports can also be interpreted as the posterior probability of class 𝜔𝜔𝑐𝑐 , given the observed data: 𝑝𝑝 𝜔𝜔𝑐𝑐 |𝑥𝑥 .
For a two class problem, we have 𝑝𝑝 𝜔𝜔+ |𝑥𝑥 = 1 − 𝑝𝑝 𝜔𝜔− |𝑥𝑥 .
 So, how do we make the actual classification decision?
 We can certainly choose 𝜔𝜔+ if 𝑝𝑝 𝜔𝜔+ |𝑥𝑥 > 𝑝𝑝 𝜔𝜔− |𝑥𝑥 , and 𝜔𝜔− , otherwise. This is equivalent to setting a decision threshold of
𝜃𝜃 = 0.5, since the decision will be 𝜔𝜔+ if 𝑝𝑝 𝜔𝜔+ |𝑥𝑥 > 𝜃𝜃, and 𝜔𝜔− otherwise.
 Now, imagine we vary that threshold between 0 and 1.
 If we set 𝜃𝜃 = 0, everything will be classified as 𝜔𝜔+ . Since no instance will be chosen as 𝜔𝜔−, we will have zero false negative,
but also no true negative. This will results in 100% recall (sensitivity) – as all of the actual positives are indeed identified as
positive – but 0% specificity.
 Now, if set 𝜃𝜃 = 1, then everything will be classified as 𝜔𝜔− . There will be no false positives, but also no true positives. This will
results in 100% specificity (since all of the actual negatives are indeed classified as negative), but 0% sensitivity.
 So, if we vary 𝜃𝜃 between 0 and 1, we can control between whether we want our classifier to be more cautious against false
positives (at the risk of reduced true positives) or more cautious against false negatives (at the risk of reduced true negatives).

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Receiver Operating Characteristic
(ROC) Curve

 Sweeping for 𝜃𝜃 between 0 and 1, and plotting the different values of true positive rate (i.e., sensitivity or recall)
against false positive rate (i.e., 1 − 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠) , we get a curve called receiver operating characteristic (ROC)
curve.
 The upper left corner represents perfect classification: 100% TPR and 0 FPR.
 Classifiers are usually not perfect, of course, so

Modified by RP from http://en.wikipedia.org/wiki/File:ROC_space-2.png


depending on the value of 𝜃𝜃, we can obtain a series
of points, when connected forms the ROC curve. ·
Usually, we get a different curve for each classifier. ·
 The diagonal line is the line of random guess, as
there are equal number of true and false positives. A
classifier whose ROC curve is close to the diagonal
is a bad classifier. The close the curve to the upper ·
left corner, the better the classifier (C is better
than A). Note that points symmetric around the
random guess are equally good (or bad), as the
decision can always be flipped.
 To quantify the ROC based comparison, we typically use area under the ROC curve (AUC). The classifier with the higher
AUC is the better classifier.
Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ
Comparison Across
Multiple Datasets
 When two approaches are compared on multiple datasets, the previous approaches are no longer
meaningful tests. One of the more commonly used approaches for comparing two approaches on
multiple datasets is the Wilcoxon’s signed-rank test.
 This test does not care about the actual performances of the classifiers, but rather the rank of the
absolute differences of the classifiers across different datasets.
 We simply look at the (+) and (-) differences in performances, and rank their absolute values separately, add them
up and compare to a critical value (hence the name signed-rank test).

 Furthermore, because the test does not assume a particular underlying distribution (Gaussian, t-
distribution, 𝜒𝜒 2 , etc.), it is a non-parametric test, which is usually more robust to outliers.

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


PR
Build-in Datasets

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Where can I get data?
Checker Board Data
1

0.9
Matlab code for those who like it
0.8

function [d,labd]=gendatcb(N,a,alpha);
0.7
% N data points, uniform distribution,
% checkboard with side a,
0.6
% rotated at alpha
d=rand(N,2);
0.5
d_transformed=[d(:,1)*cos(alpha)-...
d(:,2)*sin(alpha), ...
0.4
d(:,1)*sin(alpha)+ ...
0.3
d(:,2)*cos(alpha)];
s=ceil(d_transformed(:,1)/a)+...
0.2
floor(d_transformed(:,2)/a);
labd=2-mod(s,2);
0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


Where Can I
Find more Data
 University of California at Irvine maintains a large collection of databases that have been donated over
the years.
 As of 15 August 2022 (today!), there are 622 datasets in the repository, of various types, sizes and properties.
 The repository can be accessed at http://archive.ics.uci.edu/ml/

Fundamentals of Machine Learning © 2022 - Robi Polikar, Rowan University, Glassboro, NJ


PR
PR

You might also like