0% found this document useful (0 votes)
21 views

Presentation on ML - Copy

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Presentation on ML - Copy

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 469

Presentation on

Foundations of Machine Learning


Overview of Course
1. Introduction
2. Linear Regression and Decision Trees
3. Instance based learning Feature Selection
4. Probability and Bayes Learning
5. Support Vector Machines
6. Neural Network
7. Introduction to Computational Learning
Theory
8. Clustering
Introduction
1. Introduction
2. Different types of learning
3. Hypothesis space, Inductive Bias
4. Evaluation, Training and test set, cross-
validation
• Ever since computers were invented, we have wondered whether
they might be made to learn. If we could understand how to
program them to learn-to improve automatically with experience-
the impact would be dramatic.
• Imagine computers learning from medical records which treatments
are most effective for new diseases.
• A successful understanding of how to make computers learn would
open up many new uses of computers and new levels of competence
and customization and a detailed understanding of information
processing
• algorithms for machine learning might lead to a better
understanding
• of human learning abilities (and disabilities) as well.
Machine Learning History
• 1950s:
– Samuel's checker-playing program
• 1960s:
– Neural network: Rosenblatt's perceptron
– Minsky & Papert prove limitations of Perceptron
• 1970s:
– Symbolic concept induction
– Expert systems and knowledge acquisition
bottleneck
– QuiŶlaŶ’s ID3
– Natural language processing (symbolic)
• 1980s:
– Advanced decision tree and rule learning
– Learning and planning and problem solving
– Resurgence of neural network
– ValiaŶt’s PAC learning theory
– Focus on experimental methodology
• 90's ML and Statistics
– Data Mining
– Adaptive agents and web applications
– Text learning
– Reinforcement learning
– Bayes Net learning
• 1994: Self-driving car road test
• 1997: Deep Blue beats Gary Kasparov
• Popularity of this field in recent time and the
reasons behind that
– New software/ algorithms
• Neural networks
• Deep learning
– New hardware
• GPU’s
– Cloud Enabled
– Availability of Big Data
Programs vs learning algorithms
Algorithmic solution

Data
Computer Output
Program

Machine Learning solution

Data
Computer Program
Output
Machine Learning : Definition
• Learning is the ability to improve one's
behaviour based on experience.
• Build computer systems that automatically
improve with experience
• What are the fundamental laws that govern all
learning processes?
• Machine Learning explores algorithms that can
– learn from data / build a model from data
– use the model for prediction, decision making
or solvingsome tasks
Machine Learning : Definition
•A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P, if its
performance at tasks in T, as measured by P,
improves with experience E. [Mitchell]
Components of a learning problem
• Task: The behaviour or task being improved.
– For example: classification, acting in an
environment
• Data: The experiences that are being used to
improve performance in the task.
• Measure of improvement :
– For example: increasing accuracy in prediction,
acquiring new, improved speed and efficiency
Black-box Learner
Experiences Data Problem/Task

Background knowledge/ Answer/ Performance


Bias
Learner
Experiences Data Problem/Task

Models
Learner Reasoner

Background knowledge/
Bias Answer/ Performance
Many domains and applications
Medicine:
• Diagnose a disease
– Input: symptoms, lab measurements, test
results, DNA tests,
– Output: one of set of possible diseases, or
“none of the above”
• Data: historical medical records
• Learn: which future patients will respond best to
which treatments
Vision:
• say what objects appear in an image
• convert hand-written digits to characters 0..9
• detect where objects appear in an image
Robot control:
• Design autonomous mobile robots that learn
from experience to
– Play soccer
– Navigate from their own experience
NLP:
• detect where entities are mentioned in NL
• detect what facts are expressed in NL
• detect if a product/movie review is positive,
negative, or neutral
Financial:
• predict if a stock will rise or fall
• predict if a user will click on an ad or not
Speech recognition
Machine translation
Application in Business Intelligence
• Forecasting product sales quantities taking
seasonality and trend into account.
• Identifying cross selling promotional
opportunities for consumer goods.
•…
Some other applications
• Fraud detection : Credit card Providers
• Determine whether or not someone will
default on a home mortgage.
• Understand consumer sentiment based off of
unstructured text data.
• Forecasting women’s conviction rates based
off external macroeconomic factors.
Broad types of machine learning
• Supervised Learning
– X,Y (pre-classified training examples)
– Given an observation x, what is the best label for y?

• Unsupervised learning
–X
– Given a set of x’s, cluster or summarize them

• Semi-supervised Learning

• Reinforcement Learning
– Determine what to do based on rewards and punishments.
Supervised Learning
Given:
– a set of input features X1, … , X𝑛
– A target feature Y
– a set of training examples where the values for the input
features and the target features are given for each example
– a new example, where only the values for the input features
are given
Predict the values for the target features for the new
example.
– classification when Y is discrete
– regression when Y is continuous
Classification
• Example: Credit scoring
Differentiating between low-risk and high-risk
customers from their income and savings
Regression
• Example: Price of a used car
x : car attributes

y = g (x, 𝜃 )
y : price

g ( ) model, 𝜃 parameters
Features
• Often, the individual observations are analyzed
into a set of quantifiable properties which are
called features. May be
– categorical (e.g. "A", "B", "AB" or "O", for
blood type)
– ordinal (e.g. "large", "medium" or "small")
– integer-valued (e.g. the number of words in a
text)
– real-valued (e.g. height)
Classification learning
• Task T:
– input:
– output:
• Performance metric P:
• Experience E:
• Task T:
– input: a set of instances d1,…,dn
• an instance has a set of features
• we can represent an instance as a vector d=<x1,…,xn>
– output: a set of predictions y1,..., yn
• one of a fixed set of constant values:
– {+1,-1} or {cancer, healthy}, or {rose, hibiscus, jasmine,
…}, or …
• Performance metric P:
• Experience E:
Classification Learning
Classification Learning
Representations

1. Decision Tree

2. Linear function
3. Multivariate linear
function

4. Single layer perceptron


5. Multi-layer neural
network
Inductive Learning or Prediction
• Given examples of a function (X, F(X))
– Predict function F(X) for new examples X
• Classification
F(X) = Discrete
• Regression
F(X) = Continuous
• Probability estimation
F(X) = Probability(X):
Definitions
• Features: Properties that describe each
instance in a quantitative manner.
• Feature vector: n-dimensional vector of
features that represent some object
Terminology
Hypothesis Space
• The space of all hypotheses that can, in
principle, be output by a learning algorithm.

• One way to think about a supervised learning


machine is as a device that explores a
“hypothesis space”.

– Each setting of the parameters in the machine


is a different hypothesis about the function
that maps input vectors to output vectors.
Various Terminology
• Features: The number of features or distinct
traits that can be used to describe each item
in a quantitative manner.
• Feature vector: n-dimensional vector of
numerical features that represent some object
• Instance Space X: Set of all possible objects
describable by features.
• Example (x,y): Instance x with label y=f(x).
• Concept c: Subset of objects from X (c is
unknown).

• Target Function f: Maps each instance x ∈ X to


target label y ∈ Y

• Example (x,y): Instance x with label y=f(x).

• Training Data S: Collection of examples observed


by learning algorithm. Used to discover
potentially predictive relationships
Classifier
• Hypothesis h: Function that approximates f.
• Hypothesis Space ℋ : Set of functions we
allow for approximating f.
• The set of hypotheses that can be produced,
can be restricted further by specifying a

• Input: Training set 𝒮 ⊆ X


language bias.

• Output: A hypothesis ℎ ∈ ℋ
Hypothesis Spaces

216 (22𝑁) possible Boolean functions.


• If there are 4 (N) input features, there are

• We cannot figure out which one is correct

24(2𝑁)
unless we see every possible input-output pair
Important issues in Machine Learning

• What are good hypothesis spaces?


• Algorithms that work with the hypothesis
spaces
• How to optimize accuracy over future data
points (overfitting)
• How can we have confidence in the result?
(How much training data – statistical qs)
• Are some learning problems computationally
intractable?
Generalization
• Components of generalization error
– Bias: how much the average model over all
training sets differ from the true model?
• Error due to inaccurate
assumptions/simplifications made by the
model
– Variance: how much models estimated from
different training sets differ from each other
Underfitting and Overfitting
• Underfitting: Model is too “simple” to represent
all the relevant class characteristics
– High bias and low variance
– High training error and high test error

• Overfitting: Model is too “complex” and fits


irrelevant characteristics (noise) in the data
– Low bias and high variance
– Low training error and high test error
Experimental Evaluation of Learning
Algorithms
• Evaluating the performance of learning systems is important
because:
– Learning systems are usually designed to predict the class of
“future” unlabeled data points.

• Typical choices for Performance Evaluation:


– Error
– Accuracy
– Precision/Recall

• Typical choices for Sampling Methods:


– Train/Test Sets
– K-Fold Cross-validation
Evaluating predictions
• Suppose we want to make a prediction of a
value for a target feature on example x:
– y is the observed value of target feature on
example x.
– Y^ is the predicted value of target feature
on example x.
– How is the error measured?
Measures of error
Confusion Matrix
Sample Error and True Error
Why Errors
• Errors in learning are caused by:
– Limited representation (representation bias)
– Limited search (search bias)
– Limited data (variance)
– Limited features (noise)
Difficulties in evaluating hypotheses
with limited data
• Bias in the estimate: The sample error is a poor estimator
of true error
– ==> test the hypothesis on an independent test set

• We divide the examples into:


– Training examples that are used to train the learner
– Test examples that are used to evaluate the learner

• Variance in the estimate: The smaller the test set, the


greater the expected variance.
Validation set

Validation fails to use all the available data


k-fold cross-validation
1. Split the data into k equal subsets

2. Perform k rounds of learning; on each round


– 1/k of the data is held out as a test set and
– the remaining examples are used as training data.

3. Compute the average test set score of the k rounds


Linear Regression and Decision
Tree
Regression
• In regression the output is continuous
• Many models could be used – Simplest is linear
regression
– Fit data with the best hyper-plane which "goes
through" the points

y
dependent
variable
(output)

x – independent variable (input)


A Simple Example: Fitting a Polynomial
• The green curve is the true
function (which is not a
polynomial)

• We may use a loss function that


measures the squared error in
the prediction of y(x) from x.

from Bishop’s book on Machine Learning 62


Some fits to the data: which is best?
from Bishop

63
Types of Regression Models

Regression
1 feature Models 2+ features

Simple Multiple

Non- Non-
Linear Linear
Linear Linear
Linear regression
• Given an input x compute an
output y
• For example:
Y
- Predict height from age
- Predict house price from
house area
- Predict distance from wall
from sensors
X
Simple Linear Regression Equation
E(y)

Regression line

Intercept
Slope β1
b0

x
Linear Regression Model

• Relationship Between Variables Is a Linear


Function
Population Population Random
Y-Intercept Slope Error

Y=
House Number Y: Actual Selling X: House Size (100s
Price ft2)
1 89.5 20.0
2 79.9 14.8
3 83.1 20.5
Sample 15
4 56.9 12.5
houses
5 66.6 18.0
6 82.5 14.3
from the
7 126.3 27.5
region.
8 79.3 16.5
9 119.9 24.3
10 87.6 20.2
11 112.6 22.0
12 120.8 .019
13 78.5 12.3
14 74.3 14.0
15 74.8 16.7
Averages 88.84 18.17
House price vs size
Linear Regression – Multiple Variables

• 0 is the intercept (i.e. the average value for Y if all


the X’s are zero), j is the slope for the jth variable Xj

70
Assumption
• The data may not form a perfect line.
• When we actually take a measurement (i.e., observe
the data), we observe:
Yi = 0 + 1Xi + i,
where i is the random error associated with the ith
observation.
Assumptions about the Error

• E(i ) = 0 for i = 1, 2,…,n.

• (i ) =  where  is unknown.

• The errors are independent.

• The i are normally distributed (with mean 0 and

standard deviation ).


The regression line
The least-squares regression line is the unique line such that
the sum of the squared vertical (y) distances between the
data points and the line is the smallest possible.
Criterion for choosing what line to draw:
method of least squares
• The method of least squares chooses the line (and )
that makes the sum of squares of the residuals as
small as possible
• Minimizes
n

 i 0 1i
[ y
i 1
 (b  b x )]2

for the given observations ( xi , yi )


How do we "learn" parameters
• For the 2-d problem

• To find the values for the coefficients which minimize the


objective function we take the partial derivates of the
objective function (SSE) with respect to the coefficients. Set
these to 0, and solve.

75
Multiple Linear Regression
Y  0  1 X 1   2 X 2     n X n

• There is a closed form which requires matrix


inversion, etc.
• There are iterative techniques to find weights
– delta rule (also called LMS method) which will update
towards the objective of minimizing the SSE.

76
Linear Regression
To learn the parameters
• Make h(x) close to y, for the available training
examples.
• Define a cost function
J(
• Find that minimizes J().
LMS Algorithm
• Start a search algorithm (e.g. gradient descent algorithm,)
with initial guess of .
• Repeatedly update to make J() smaller, until it converges to
minima.
𝜕
β j=β j − 𝛼 𝐽 ( 𝜃)
𝜕βj
• Jis a convex quadratic function, so has a single global minima.
gradient descent eventually converges at the global minima.
• At each iteration this algorithm takes a step in the direction of
steepest descent(-ve direction of gradient).
LMS Update Rule

• If you have only one training example


J(

• For a single training example, this gives the


update rule:
m training examples
Repeat until convergence {

Batch Gradient Descent: looks at every example on


each step.
Stochastic gradient descent
• Repeatedly run through the training set.
• Whenever a training point is encountered, update the
parameters according to the gradient of the error with
respect to that training example only.

Repeat {
for I = 1 to m do
(for every j)
end for
} until convergence
Delta Rule for Classification
1
z
0
x

1
z
0
x

• What would happen in this adjusted case for perceptron and delta rule and
where would the decision point (i.e. .5 crossing) be?

CS 478 - Regression 82
Delta Rule for Classification
1
z
0
x

1
z
0
x

• Leads to misclassifications even though the data is linearly separable


• For Delta rule the objective function is to minimize the regression line SSE,
not maximize classification

CS 478 - Regression 83
Delta Rule for Classification
1
z
0
x

1
z
0
x

1
z
0
x
• What would happen if we were doing a regression fit with a sigmoid/logistic
curve rather than a line?
CS 478 - Regression 84
Delta Rule for Classification
1
z
0
x

1
z
0
x

1
z
0
x
• Sigmoid fits many decision cases quite well! This is basically what logistic
regression does.
85
Definition
• A decision tree is a classifier in the form of a
tree structure with two types of nodes:
– Decision node: Specifies a choice or test of
some attribute, with one branch for each
outcome
– Leaf node: Indicates classification of an example
Decision Tree Example 1
Whether to approve a loan
Employed?
No Yes

Credit
Income?
Score?
High Low High Low

Approve Reject Approve Reject


Decision Tree Example 3
Issues
• Given some training examples, what decision tree
should be generated?
• One proposal: prefer the smallest tree that is
consistent with the data (Bias)
– the tree with the least depth?
– the trth the fewest nodes?
• Possible method:
– search the space of decision trees for the smallest decision
tree that fits the data
Example Data
Training Examples:
Action Author Thread Length Where
e1 skips known new long Home
e2 reads unknown new short Work
e3 skips unknown old long Work
e4 skips known old long home
e5 reads known new short home
e6 skips known old long work

New Examples:
e7 ??? known new short work
e8 ??? unknown new short work
Possible splits
skips 9
length reads 9

long short skips 9


thread reads 9
Skips 2
Skips 7
Reads 9
Reads 0 new old
Skips 6
Skips 3
Reads 2
Reads 7
Two Example DTs
Decision Tree for PlayTennis
• Attributes and their values:
– Outlook: Sunny, Overcast, Rain
– Humidity: High, Normal
– Wind: Strong, Weak
– Temperature: Hot, Mild, Cool

• Target concept - Play Tennis: Yes, No


Decision Tree for PlayTennis
Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes
Decision Tree for PlayTennis
Outlook

Sunny Overcast Rain

Humidity Each internal node tests an attribute

High Normal Each branch corresponds to an


attribute value node

No Yes Each leaf node assigns a classification


Decision Tree for PlayTennis
Outlook Temperature Humidity Wind PlayTennis
Sunny Hot High Weak ? No
Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes
Decision Tree
decision trees represent disjunctions of conjunctions
Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

(Outlook=Sunny  Humidity=Normal)
 (Outlook=Overcast)
 (Outlook=Rain  Wind=Weak)
Searching for a good tree
• How should you go about building a decision tree?
• The space of decision trees is too big for systematic
search.

• Stop and
– return the a value for the target feature or
– a distribution over target feature values

• Choose a test (e.g. an input feature) to split on.


– For each value of the test, build a subtree for those
examples with this value for the test.
Top-Down Induction of Decision Trees ID3

1. Which node to proceed with?


1. A  the “best” decision attribute for next node
2. Assign A as decision attribute for node
3. For each value of A create new descendant
4. Sort training examples to leaf node according to the
attribute value of the branch
5. If all training examples are perfectly classified (same
value of target attribute) stop, else iterate over new
leaf nodes. 2. When to stop?
Choices
• When to stop
– no more input features
– all examples are classified the same
– too few examples to make an informative split

• Which test to split on


– split gives smallest error.
– With multi-valued features
• split on all values or
• split values into half.
Top-Down Induction of Decision Trees ID3

1. Which node to proceed with?


1. A  the “best” decision attribute for next node
2. Assign A as decision attribute for node
3. For each value of A create new descendant
4. Sort training examples to leaf node according to the
attribute value of the branch
5. If all training examples are perfectly classified (same
value of target attribute) stop, else iterate over new
leaf nodes. 2. When to stop?
Choices
• When to stop
– no more input features
– all examples are classified the same
– too few examples to make an informative split

• Which test to split on


– split gives smallest error.
– With multi-valued features
• split on all values or
• split values into half.
Which Attribute is ”best”?

[29+,35-] A1=? A2=? [29+,35-]

True False True False

[21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-]

ICS320 103
Principled Criterion
• Selection of an attribute to test at each node -
choosing the most useful attribute for classifying
examples.
• information gain
– measures how well a given attribute separates the training
examples according to their target classification
– This measure is used to select among the candidate
attributes at each step while growing the tree
– Gain is measure of how much we can reduce
uncertainty (Value lies between 0,1)
Entropy
• A measure for
– uncertainty
– purity
– information content
• Information theory: optimal length code assigns (- log2p) bits
to message having probability p
• S is a sample of training examples
– p+ is the proportion of positive examples in S
– p- is the proportion of negative examples in S
• Entropy of S: average optimal number of bits to encode
information about certainty/uncertainty about S
Entropy(S) = p+(-log2p+) + p-(-log2p-) = -p+log2p+- p-
log2p-
Entropy

• The entropy is 0 if the outcome


is ``certain”.
• The entropy is maximum if we
have no knowledge of the
system (or any outcome is
equally possible).

• S is a sample of training examples


• p+ is the proportion of positive examples
• p- is the proportion of negative examples
• Entropy measures the impurity of S
Entropy(S) = -p+log2p+- p-log2 p-
Information Gain
Gain(S,A): expected reduction in entropy due to partitioning S
on attribute A

Gain(S,A)=Entropy(S) vvalues(A) |Sv|/|S| Entropy(Sv)


Entropy([29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64
= 0.99

[29+,35-] A1=? A2=? [29+,35-]

True False True False

[21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-]


Information Gain
Entropy([21+,5-]) = 0.71 Entropy([18+,33-]) = 0.94
Entropy([8+,30-]) = 0.74 Entropy([8+,30-]) = 0.62
Gain(S,A1)=Entropy(S)
Gain(S,A2)=Entropy(S)
-26/64*Entropy([21+,5-])
-38/64*Entropy([8+,30-]) -51/64*Entropy([18+,33-])
=0.27 -13/64*Entropy([11+,2-])
=0.12

[29+,35-] A1=? A2=? [29+,35-]

True False True False

[21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-]


ICS320 108
Training Examples
Day Outlook Temp Humidity Wind Tennis?
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Selecting the Next Attribute
S=[9+,5-] S=[9+,5-]
E=0.940 E=0.940

Humidity Wind

High Normal Weak Strong

[3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-]

E=0.985 E=0.592 E=0.811 E=1.0

Gain(S,Humidity) Gain(S,Wind)
=0.940-(7/14)*0.985 =0.940-(8/14)*0.811
– (7/14)*0.592 – (6/14)*1.0
=0.151 =0.048
Humidity provides greater info. gain than Wind, w.r.t target classification.
Selecting the Next Attribute
S=[9+,5-]
E=0.940

Outlook

Sunny Overcast Rain

[2+, 3-] [4+, 0] [3+, 2-]

E=0.971 E=0.0 E=0.971

Gain(S,Outlook)
=0.940-(5/14)*0.971
-(4/14)*0.0 – (5/14)*0.0971
=0.247
Selecting the Next Attribute
The information gain values for the 4 attributes are:
• Gain(S,Outlook) =0.247
• Gain(S,Humidity) =0.151
• Gain(S,Wind) =0.048
• Gain(S,Temperature) =0.029

where S denotes the collection of training examples

Note: 0Log20 =0
ID3 Algorithm
[D1,D2,…,D14] Outlook
[9+,5-]

Sunny Overcast Rain

Ssunny=[D1,D2,D8,D9,D11] [D3,D7,D12,D13] [D4,D5,D6,D10,D14]


[4+,0-] [3+,2-]
[2+,3-]

? Yes ?
Test for this node

Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970


Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570
Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019
ID3 Algorithm
Outlook

Sunny Overcast Rain

Humidity Yes Wind


[D3,D7,D12,D13]

High Normal Strong Weak

No Yes No Yes

[D1,D2] [D8,D9,D11] [D6,D14] [D4,D5,D10]


Splitting Rule: GINI Index
• GINI Index
– Measure of node impurity

GINInode(Node) 1 [ p(c)] 2

c  classes

Sv
GINIsplit (A)   S
GINI(N v )
v Values(A )
Splitting Based on Continuous Attributes

Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split


Continuous Attribute – Binary Split
• For continuous attribute
– Partition the continuous value of attribute A into a
discrete set of intervals
– Create a new boolean attribute Ac , looking for a
threshold c,
true if Ac  c
Ac 
 false otherwise
How to choose c ?
• consider all possible splits and finds the best cut
Practical Issues of Classification
• Underfitting and Overfitting

• Missing Values

• Costs of Classification
Hypothesis Space Search in Decision Trees
• Conduct a search of the space of decision trees which
can represent all possible discrete functions.

• Goal: to find the best decision tree

• Finding a minimal decision tree consistent with a set of data


is NP-hard.

• Perform a greedy heuristic search: hill climbing without


backtracking

• Statistics-based decisions using all data

119
Overfitting
• Learning a tree that classifies the training data
perfectly may not lead to the tree with the best
generalization performance.
– There may be noise in the training data
– May be based on insufficient data
• A hypothesis h is said to overfit the training
data if there is another hypothesis, h’, such that
h has smaller error than h’ on the training data
but h has larger error on the test data than h’.
Overfitting
• Learning a tree that classifies the training data perfectly may not lead to
the tree with the best generalization performance.
– There may be noise in the training data
– May be based on insufficient data
• A hypothesis h is said to overfit the training data if there is another
hypothesis, h’, such that h has smaller error than h’ on the training data
but h has larger error on the test data than h’.

On training

accuracy On testing

Complexity of tree
Underfitting and Overfitting (Example)

500 circular and 500


triangular data points.

Circular points:
0.5  sqrt(x12+x22)  1

Triangular points:
sqrt(x12+x22) > 0.5 or
sqrt(x12+x22) < 1
Underfitting and Overfitting
Overfitting

Underfitting: when model is too simple, both training and test errors are large
Overfitting due to Noise

Decision boundary is distorted by noise point


Overfitting due to Insufficient Examples

Lack of data points makes it difficult to predict correctly the class labels
of that region
Notes on Overfitting
• Overfitting results in decision trees that are more
complex than necessary

• Training error no longer provides a good estimate of


how well the tree will perform on previously unseen
records
Avoid Overfitting
• How can we avoid overfitting a decision tree?
– Prepruning: Stop growing when data split not statistically
significant
– Postpruning: Grow full tree then remove nodes

• Methods for evaluating subtrees to prune:


– Minimum description length (MDL):
Minimize: size(tree) + size(misclassifications(tree))
– Cross-validation

ICS320 128
Pre-Pruning (Early Stopping)
• Evaluate splits before installing them:
– Don’t install splits that don’t look worthwhile
– when no worthwhile splits to install, done
Pre-Pruning (Early Stopping)
• Typical stopping conditions for a node:
– Stop if all instances belong to the same class
– Stop if all the attribute values are the same
• More restrictive conditions:
– Stop if number of instances is less than some user-specified
threshold
– Stop if class distribution of instances are independent of the
available features (e.g., using  2 test)
– Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
Reduced-error Pruning
• A post-pruning, cross validation approach
- Partition training data into “grow” set and “validation” set.
- Build a complete tree for the “grow” data
- Until accuracy on validation set decreases, do:
For each non-leaf node in the tree
Temporarily prune the tree below; replace it by majority vote
Test the accuracy of the hypothesis on the validation set
Permanently prune the node with the greatest increase
in accuracy on the validation test.
• Problem: Uses less data to construct the tree
• Sometimes done at the rules level

General Strategy: Overfit and Simplify


Reduced Error Pruning
Model Selection & Generalization
• Learning is an ill-posed problem; data is not sufficient
to find a unique solution
• The need for inductive bias, assumptions about H
• Generalization: How well a model performs on new
data
• Overfitting: H more complex than C or f
• Underfitting: H less complex than C or f

133
Triple Trade-Off
• There is a trade-off between three factors:
– Complexity of H, c (H),
– Training set size, N,
– Generalization error, E on new data overfitting

• As N increases­, E decreases
• As c (H) increases­, first E decreases and then E­increases
• As c (H)­increases, the training error decreases for some time
and then stays constant (frequently at 0)

134
Notes on Overfitting
• overfitting happens when a model is capturing
idiosyncrasies of the data rather than generalities.
– Often caused by too many parameters relative to the
amount of training data.
– E.g. an order-N polynomial can intersect any N+1 data
points
Dealing with Overfitting
• Use more data
• Use a tuning set
• Regularization
• Be a Bayesian

136
Regularization
• In a linear regression model overfitting is
characterized by large weights.

137
Penalize large weights in Linear Regression
• Introduce a penalty term in the loss function.

Regularized Regression
1. (L2-Regularization or Ridge Regression)

2. L1-Regularization

138
Feature Reduction in ML
- The information about the target class is inherent
in the variables.
- Naïve view:
More features
=> More information
=> More discrimination power.
- In practice:
many reasons why this is not the case!
Curse of Dimensionality
• number of training examples is fixed
=> the classifier’s performance usually will
degrade for a large number of features!
Feature Selection
Problem of selecting some subset of features, while
ignoring the rest

Feature Extraction
• Project the original xi, i =1,...,ddimensions to new
dimensions, zj , j =1,...,k

Criteria for selection/extraction:


either improve or maintain the classification
accuracy, simplify classifier complexity.
Feature Selection - Definition
• Given a set of features
the Feature Selection problem is
to find a subset that maximizes the learners
ability to classify patterns.
• Formally should maximize some scoring function
Subset selection
• d initial features
• There are possible subsets
• Criteria to decide which subset is the best:
– classifier based on these m features has the
lowest probability of error of all such classifiers
• Can’t go over all possibilities
• Need some heuristics

143
Feature Selection Steps

Feature selection is an
optimization problem.
o Step 1: Search the space of
possible feature subsets.
o Step 2: Pick the subset
that is optimal or near-
optimal with respect to
some objective function.
Feature Selection Steps (cont’d)

Search strategies
– Optimum
– Heuristic
– Randomized

Evaluation strategies
- Filter methods
- Wrapper methods
Evaluating feature subset
• Supervised (wrapper method)
– Train using selected subset
– Estimate error on validation dataset

• Unsupervised (filter method)


– Look at input only
– Select the subset that has the most information
Evaluation Strategies
Filter Methods Wrapper Methods
Subset selection
• Select uncorrelated features
• Forward search
– Start from empty set of features
– Try each of remaining features
– Estimate classification/regression error for adding specific feature
– Select feature that gives maximum improvement in validation
error
– Stop when no significant improvement
• Backward search
– Start with original set of size d
– Drop features with smallest impact on error
Feature selection
Univariate (looks at each feature independently of others)
– Pearson correlation coefficient
– F-score Univariate methods measure some
– Chi-square type of correlation between two
– Signal to noise ratio random variables
– mutual information • the label (yi) and a fixed feature (xij
– Etc. for fixed j)

• Rank features by importance


• Ranking cut-off is determined by user
Pearson correlation coefficient
• Measures the correlation between two variables
• Formula for Pearson correlation =

• The correlation r is between and .


 means perfect positive correlation
 in the other direction
Pearson correlation coefficient

From Wikipedia
Signal to noise ratio
• Difference in means divided by difference in
standard deviation between the two classes

S2N(X,Y) = (μX - μY)/(σX – σY)

• Large values indicate a strong correlation


Multivariate feature selection
• Multivariate (considers all features simultaneously)
• Consider the vector w for any linear classifier.
• Classification of a point x is given by wTx+w0.
• Small entries of w will have little effect on the dot
product and therefore those features are less relevant.
• For example if w = (10, .01, -9) then features 0 and 2 are
contributing more to the dot product than feature 1.
– A ranking of features given by this w is 0, 2, 1.
Multivariate feature selection
• The w can be obtained by any of linear classifiers
• A variant of this approach is called recursive feature
elimination:
– Compute w on all features
– Remove feature with smallest wi
– Recompute w on reduced data
– If stopping criterion not met then go to step 2
Feature extraction - definition
• Given a set of features
the Feature Extraction(“Construction”) problem is
is to map to some feature set that maximizes the
learner’s ability to classify patterns
Feature Extraction

• Find a projection matrix w from N-


dimensional to M-dimensional vectors that
keeps error low
PCA
• Assume that N features are linear
combination of vectors

• What we expect from such basis


– Uncorrelated or otherwise can be reduced further
– Have large variance (e.g. have large variation) or
otherwise bear no information
Geometric picture of principal components (PCs)
Geometric picture of principal components (PCs)
Geometric picture of principal components (PCs)
Algebraic definition of PCs
Given a sample of p observations on a vector of N variables

x , x ,, x  
1 2 p
N

define the first principal component of the sample


by the linear transformation
N
z1 a x j  ai1 xij ,
T
1 j 1,2,  , p.
i 1

where the vector a1 (a11 , a21 ,  , a N 1 )


x j ( x1 j , x2 j ,  , x Nj )
is chosen such that var[ z1 ] is maximum.
PCA
• Choose directions such that a total variance of data
will be maximum
– Maximize Total Variance
• Choose directions that are orthogonal
– Minimize correlation
• Choose orthogonal directions which maximize total
variance
PCA
• -dimensional feature space
• symmetric covariance matrix estimated from samples
• Select largest eigenvalue of the covariance matrix
and associated eigenvectors
• The first eigenvector will be a direction with largest
variance
PCA for image compression

p=1 p=2 p=4 p=8

Original
p=16 p=32 p=64 p=100 Image
Is PCA a good criterion for classification?

• Data variation
determines the
projection direction
• What’s missing?
– Class information
What is a good projection?
Two classes
• Similarly, what is a
overlap
good criterion?
– Separating different
classes

Two classes
are separated
What class information may be useful?
• Between-class distance
– Distance between the centroids
of different classes

Between-class distance
What class information may be useful?
• Between-class distance
– Distance between the centroids of
different classes
• Within-class distance
• Accumulated distance of an instance
to the centroid of its class

• Linear discriminant analysis (LDA) finds


most discriminant projection by
• maximizing between-class distance
• and minimizing within-class distance Within-class distance
Linear Discriminant Analysis
• Find a low-dimensional space such that when
is projected, classes are well-separated
Means and Scatter after projection
Good Projection
• Means are as far away as possible
• Scatter is small as possible
• Fisher Linear Discriminant

 m1  m2 
2

J w   2 2
s s
1 2
Multiple Classes
• For classes, compute discriminants, project N-
dimensional features into space.
Probability Basics

• Probability is the study of randomness and


uncertainty.
• A random experiment is a process whose
outcome is uncertain.
Examples:
– Tossing a coin once or several times
– Tossing a die
– Tossing a coin until one gets Heads
– ...
Events and Sample Spaces
Sample Space
The sample space is the set of all possible outcomes.

Event
An event is any
Simple Events collection of one or
The individual outcomes more simple events
are called simple events. 175
Sample Space
• Sample space : the set of all the possible
outcomes of the experiment
– If the experiment is a roll of a six-sided die, then the
natural sample space is {1, 2, 3, 4, 5, 6}
– Suppose the experiment consists of tossing a coin three
times.

– the experiment is the number of customers that arrive


at a service desk during a fixed time period, the sample
space should be the set of nonnegative integers:
Events
• Events are subsets of the sample space
o A= {the outcome that the die is even} ={2,4,6}
o B = {exactly two tosses come out tails}=(htt, tht, tth}
o C = {at least two heads} = {hhh, hht, hth, thh}
Probability
• A Probability is a number assigned to each
event in the sample space.
• Axioms of Probability:
– For any event A, 0  P(A)  1.
– P() =1 and
– If A1, A2, … Anis a partition of A, then
P(A) = P(A1)+ P(A2)+...+ P(An)
Properties of Probability
• For any event A, P(Ac) = 1 - P(A).
• If A  B, then P(A)  P(B).
• For any two events A and B,
P(A  B) = P(A) + P(B) - P(A  B).
For three events, A, B, and C,
P(ABC) =
P(A) + P(B) + P(C)
- P(AB) - P(AC) - P(BC)
+ P(AB C)
Intuitive Development (agrees with axioms)
• Intuitively, the probability of an event a could
be defined as:

Where N(a) is the number that event a happens in n trials

180
Random Variable
• A random variable is a function defined on the
sample space
– maps the outcome of a random event into real
scalar values

W
X(w)
w
Discrete Random Variables
• Random variables (RVs) which may take on only a
countable number of distinct values
– e.g., the sum of the value of two dies

• X is a RV with arity k if it can take on exactly one


value out of k values,
– e.g., the possible values that X can take on are
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
Probability of Discrete RV
• Probability mass function (pmf): P X  xi 
• Simple facts about pmf
–  i P X  xi  1
– P X  xi  X  x j  0 if i  j
– P X  xi  X  x j  P X  xi   P X  x j if i  j
– P X  x1  X  x2   X  xk  1
Common Distributions
• Uniform
– X takes values 1, 2, …, N
P X i  1 N
– E.g. picking balls of different colors from a box
• Binomial
– X takes values 0, 1, …, n
 n i n i
P X i    p 1  p 
i
– E.g. coin flips
Joint Distribution
• Given two discrete RVs X and Y, their joint
distribution is the distribution of X and Y together
– e.g.
you and your friend each toss a coin 10 times
P(You get 5 heads AND you friend get 7 heads)
•  x y
P X  x  Y  y  1

50 100
  i 0 j 0
P You get i heads AND your friend get j heads  1
Conditional Probability
 
• P X  x Y  y is the probability of , given the
occurrence of
– E.g. you get 0 heads, given that your friend gets
3heads
P X  x  Y  y 
P X  x Y  y  
P Y  y 
Law of Total Probability
• Given two discrete RVs X and Y, which take values in
 x1 , , xm   y1 , , yn 
and , We have

P X  xi    P X  x  Y  y 
j i j

 P X  x Y  y P Y  y 
i j j
j
Marginalization

Marginal Probability Joint Probability

P X  xi    P X  x  Y  y 
j i j

 P X  x Y  y P Y  y 
i j j
j

Conditional Probability Marginal Probability


Bayes Rule
• X and Y are discrete RVs…

P X  x  Y  y 
P X  x Y  y  
P Y  y 

 
P Y  y j X  xi P X  xi 
 
P X  xi Y  y j 
 P Y  y
k j 
X  xk P X  xk 
Independent RVs

• X and Y are independent means that does not


affect the probability of

• Definition: X and Y are independent iff


– P(XY) = P(X)P(Y)
P X  x  Y  y  P X  x  P Y  y 
More on Independence

P X  x  Y  y  P X  x  P Y  y 

P X  x Y  y  P X  x  P Y  y X  x  P Y  y 

• E.g. no matter how many heads you get, your


friend will not be affected, and vice versa
Conditionally Independent RVs
• Intuition: X and Y are conditionally
independent given Z means that once Z is
known, the value of X does not add any
additional information about Y
• Definition: X and Y are conditionally
independent given Z iff

P X  x  Y  y Z  z  P X  x Z  z  P Y  y Z  z 
More on Conditional Independence

P X  x  Y  y Z  z  P X  x Z  z  P Y  y Z  z 

P X  x Y  y, Z  z  P X  x Z  z 

P Y  y X  x, Z  z  P Y  y Z  z 
Continuous Random Variables
• What if X is continuous?
• Probability density function (pdf) instead of
probability mass function (pmf)
• A pdf is any function that describes the
probability density in terms of the input
variable x.
PDF
• Properties of pdf

f  x  0, x
– 

– 
f  x  1
f  x  1 ???
• Actual probability can be obtained by taking
the integral of pdf
– E.g. the probability of X being between 0 and 1 is
1
P 0  X 1   f x dx
0
Cumulative Distribution Function
• FX v  P X v 
• Discrete RVs
– FX v   vi P X vi 
• Continuous RVs
v
– FX v   
f  x  dx
d
– FX  x   f  x 
dx
Common Distributions

• Normal
1  2
 x    
f x   exp  2 , x  
2   2 
– E.g. the height of the entire population
0.4

0.35

0.3

0.25
f(x)

0.2

0.15

0.1

0.05

0
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
Multivariate Normal
• Generalization to higher dimensions of the
one-dimensional normal
Covariance Matrix
• 1
f X  x1 , , xd   d 2 12
2  
 1  T  1 
exp   x      x   
 2 
Mean
Mean and Variance
• Mean (Expectation):   E X 
– Discrete RVs: E X    vi
vi P X vi 

– Continuous RVs: E X   
xf  x  dx

2
• Variance: V X   E X   
– Discrete RVs:

2
V X  
vi
vi    P X vi 

– Continuous RVs: V X   
2
 x    f  x dx

Mean Estimation from Samples
• Given a set of N samples from a distribution,
we can estimate the mean of the distribution
by:

200
Probability for Learning
• Probability for classification and modeling
concepts.
• Bayesian probability
– Notion of probability interpreted as partial belief
• Bayesian Estimation
– Calculate the validity of a proposition
• Based on prior estimate of its probability
• and New relevant evidence
Bayes Theorem
Goal: To determine the most probable hypothesis, given the data D plus any
initial knowledge about the prior probabilities of the various hypotheses in H.

P ( D | h) P ( h)
Bayes Rule: P(h | D) 
P( D)
• P(h) = prior probability of hypothesis h
• P(D) = prior probability of training data D
• P(h|D) = probability of h given D (posterior density )
• P(D|h) = probability of D given h (likelihood of D given h)
An Example
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test
returns a correct positive result in only 98% of the cases in which the
disease is actually present, and a correct negative result in only 97% of
the cases in which the disease is not present. Furthermore, .008 of the
entire population have this cancer.

P (cancer ) .008, P ( cancer ) .992


P (  | cancer ) .98, P (  | cancer ) .02
P (  |  cancer ) .03, P (  |  cancer ) .97
P (  | cancer ) P (cancer )
P (cancer | ) 
P ( )
P (  |  cancer ) P ( cancer )
P ( cancer | ) 
P ( )
Maximum A Posteriori (MAP) Hypothesis
P ( D | h) P ( h)
P(h | D) 
P( D)
The Goal of Bayesian Learning: the most probable hypothesis
given the training data (Maximum A Posteriori hypothesis)

hMAP arg max P (h | D)


hH

P ( D | h) P ( h)
arg max
hH P( D)
arg max P( D | h) P (h)
hH
Maximum Likelihood (ML) Hypothesis
hMAP arg max P (h | D)
hH

P ( D | h) P ( h)
arg max
hH P( D)
arg max P ( D | h) P (h)
hH

• If every hypothesis in H is equally probable a priori,


we only need to consider the likelihood of the data D
given h, P(D|h). Then, hMAP becomes the Maximum
Likelihood,
hML= argmax hH P(D|h)
MAP Learner
For each hypothesis h in H, calculate the posterior probability
P ( D | h) P ( h)
P(h | D) 
P( D)
Output the hypothesis hMAP with the highest posterior probability
hMAP max P (h | D)
hH

Comments:
Computational intensive
Providing a standard for judging the performance of
learning algorithms
Choosing P(h) and P(D|h) reflects our prior
knowledge about the learning task
Maximum likelihood and least-squared error
• Learn a Real-Valued Function:
• Consider any real-valued target function f.
• Training examples (xi,di) are assumed to have Normally
distributed noise ei with zero mean and variance σ2, added
to the true target value f(xi),
di satisfies
Assume that ei is drawn independently for each xi .
Compute ML Hypo

hML arg max p ( D | h)


hH
m 1 d  h ( xi ) 2
1  ( i )
arg max  e 2 
hH 2
i 1
2

m
1 1 d i  h( xi ) 2
arg max   ln(2 )  (
2
)
hH i 1 2 2 
m
arg min  (d i  h( xi )) 2
hH i 1

208
Bayes Optimal Classifier
Question: Given new instance x, what is its most probable classification?
• is not the most probable classification!
Example: Let P(h1|D) = .4, P(h2|D) = .3, P(h3 |D) =.3
Given new data x, we have h1(x)=+, h2(x) = -, h3(x) = -
What is the most probable classification of x ?
Bayes optimal classification:
arg max  P(v j | hi ) P(hi | D)
v j V hi H
where V is the set of all the values a classification can take and vjis one
possible such classification.
Example:
P(h1| D) =.4, P(-|h1)=0, P(+|h1)=1  P( | h ) P(h | D) .4
hiH
i i

P(h2|D) =.3, P(-|h2)=1, P(+|h2)=0


P(h3|D)=.3, P(-|h3)=1, P(+|h3)=0  P( | h ) P(h | D) .6
hiH
i i
Why “Optimal”?
• Optimal in the sense that no other classifier
using the same H and prior knowledge can
outperform it on average

210
Gibbs Algorithm
• Bayes optimal classifier is quite computationally
expensive, if H contains a large number of
hypotheses.
• An alternative, less optimal classifier Gibbs algorithm,
defined as follows:
1. Choose a hypothesis randomly according to P(h|
D), where D is the posterior probability
distribution over H.
2. Use it to classify new instance

211
Naïve Bayes
• Bayes classification
P(Y | X)  P( X | Y ) P(Y ) P( X 1 ,, X n | Y ) P(Y )
Difficulty: learning the joint probability P(X1,,Xn | C)
• Naïve Bayes classification
Assume all input features are conditionally independent!
P( X 1 , X 2 ,, X n | Y ) P( X 1 | X 2 ,, X n , Y ) P( X 2 ,, X n | Y )
P( X 1 | Y ) P( X 2 ,, X n | Y )
P( X 1 | Y ) P( X 2 | Y ) P( X n | Y )
Naïve Bayes
Bayes rule:

Assuming conditional independence among Xi’s:

So, classification rule for Xnew = < X1, …, Xn > is:


Naïve Bayes Algorithm – discrete Xi

• Train Naïve Bayes (examples)


for each* value yk
estimate
for each* value xij of each attribute Xi
estimate

• Classify (Xnew)

*
probabilities must sum to 1, so need estimate only n-1 parameters...
Estimating Parameters: Y, Xi discrete-valued

Maximum likelihood estimates (MLE’s):

Number of items in set D for


which Y=yk
Example
• Example: Play Tennis

216
Example
Learning Phase
Outloo Play=Y Play= Temperature Play=Yes Play=No
k es No Hot 2/9 2/5
Sunny 2/9 3/5 Mild 4/9 2/5
Overcas 4/9 0/5
t
Cool 3/9 1/5
Rain 3/9 2/5
Humidity Play=Yes Play=No Wind Play=Yes Play=No
High 3/9 4/5 Strong 3/9 3/5
Normal 6/9 1/5 Weak 6/9 2/5

P(Play=Yes) = 9/14 P(Play=No) = 5/14

217
Example
Test Phase
– Given a new instance, predict its label
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables achieved in the learning phrase
P(Outlook=Sunny|Play=No) = 3/5
P(Outlook=Sunny|Play=Yes) = 2/9
P(Temperature=Cool|Play==No) = 1/5
P(Temperature=Cool|Play=Yes) = 3/9
P(Huminity=High|Play=No) = 4/5
P(Huminity=High|Play=Yes) = 3/9
P(Wind=Strong|Play=No) = 3/5
P(Wind=Strong|Play=Yes) = 3/9
P(Play=No) = 5/14
P(Play=Yes) = 9/14
– Decision making with the MAP rule
P(Yes|x’) ≈ [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) =
0.0053
P(No|x’) ≈ [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) =
0.0206

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”. 218


Estimating Parameters: Y, Xi discrete-valued

If unlucky, our MLE estimate for P(Xi | Y) may be zero.

MAP estimates:
Only difference:
“imaginary” examples
Naïve Bayes: Assumptions of Conditional
Independence
Often the Xi are not really conditionally independent

• We can use Naïve Bayes in many cases anyway


– often the right classification, even when not the right
probability
Gaussian Naïve Bayes (continuous X)
• Algorithm: Continuous-valued Features
– Conditional probability often modeled with the normal
distribution

Sometimes assume variance


– is independent of Y (i.e., i),
– or independent of Xi (i.e., k)
– or both (i.e., )
221
Gaussian Naïve Bayes Algorithm – continuous Xi
(but still discrete Y)
• Train Naïve Bayes (examples)
for each value yk
estimate*
for each attribute Xi estimate
class conditional mean , variance

• Classify (Xnew)
Estimating Parameters: Y discrete, Xi continuous

Maximum likelihood estimates: jth training


example

ith feature kth class


(z)=1 if z true,
else 0
Naïve Bayes
• Example: Continuous-valued Features
– Temperature is naturally of continuous value.
Yes: 25.2, 19.3, 18.5, 21.7, 20.1, 24.3, 22.8, 23.1, 19.8
No: 27.3, 30.1, 17.4, 29.5, 15.1
– Estimate mean and variance for each class
1 N 1 N Yes 21.64, Yes 2.35
   xn ,    (xn  )2
2
N n1 N n1  No 23.88, No 7.09
– Learning Phase: output two Gaussian models for P(temp|C)

ˆ 1  ( x  21 . 64 ) 2
 1  ( x  21 . 64) 2

P ( x | Yes )  exp  2
  exp  
2.35 2  2 2.35  2.35 2  11 .09 

ˆ 1  ( x  23. 88 ) 2
 1  ( x  23 . 88 ) 2

P ( x | No)  
exp  2

  
exp  
7.09 2  2 7.09  7.09 2  50.25 
224
The independence hypothesis…
• makes computation possible
• yields optimal classifiers when satisfied
• Rarely satisfied in practice, as attributes (variables)
are often correlated.
• To overcome this limitation:
– Bayesian networks combine Bayesian reasoning with
causal relationships between attributes
Bayesian Networks
Why Bayes Network
• Bayes optimal classifier is too costly to apply
• Naïve Bayes makes overly restrictive
assumptions.
– But all variables are rarely completely independent.
• Bayes network represents conditional
independence relations among the features.
• Representation of causal relations makes the
representation and inference efficient.
Late Rainy
Accident
wakeup day

Traffic Meeting
Jam postponed

Late for
Work

Late for
meeting
Bayesian Network
• A graphical model that efficiently encodes the joint
probability distribution for a large set of variables
• A Bayesian Network for a set of variables (nodes)
X = { X1,…….Xn}
• Arcs represent probabilistic dependence among
variables
• Lack of an arc denotes a conditional independence
• The network structure S is a directed acyclic graph
• A set P of local probability distributions at each node
(Conditional Probability Table)
Representation in Bayesian Belief
Networks
Late
Accide Rainy
wake Conditional probability table
nt day
up
associated with each node
specifies the conditional
Traffic Meeting distribution for the
Jam postponed variable given its immediate
Late parents in
for the graph
Work

Late for
meeting

Each node is asserted to be conditionally independent of


its non-descendants, given its immediate parents
230
Inference in Bayesian Networks
• Computes posterior probabilities given evidence about
some nodes
• Exploits probabilistic independence for efficient
computation.
• Unfortunately, exact inference of probabilities in
general for an arbitrary Bayesian Network is known to
be NP-hard.
• In theory, approximate techniques (such as Monte Carlo
Methods) can also be NP-hard, though in practice,
many such methods were shown to be useful.
• Efficient algorithms leverage the structure of the graph

231
Applications of Bayesian Networks
• Diagnosis: P(cause|symptom)=?
• Prediction: P(symptom|cause)=?
• Classification: P(class|data) cause
cause

• Decision-making
(given a cost function) C C2
1

symptom
symptom
Bayesian Networks
• Structure of the graph  Conditional independence relations

In general,
p(X1, X2,....XN) =  p(Xi | parents(Xi ) )

The full joint distribution


The graph-structured approximation
• Requires that graph is acyclic (no directed cycles)

• 2 components to a Bayesian network


– The graph structure (conditional independence
assumptions)
– The numerical probabilities (for each variable given its
parents)
Examples
A B C Marginal Independence:
p(A,B,C) = p(A) p(B) p(C)

A: D
Conditionally independent effects:
p(A,B,C) = p(B|A)p(C|A)p(A)
B and C are conditionally independent
B: S1 C: S2 Given A

A: Traffic B: Late wakeup Independent Causes:


p(A,B,C) = p(C|A,B)p(A)p(B)
“Explaining away”
C: late

A B C Markov dependence:
p(A,B,C) = p(C|B) p(B|A)p(A)
Naïve Bayes Model
Y1 Y2 Y3 Yn

C
Hidden Markov Model (HMM)
Y1 Y3 Yn
Observed
Y2

----------------------------------------------------

S1 S2 S3 Sn Hidden

Assumptions:
1. hidden state sequence is Markov
2. observation Yt is conditionally independent of all other
variables given St

Widely used in sequence learning eg, speech recognition, POS


tagging
Learning Bayesian Belief Networks
1. The network structure is given in advance and all the
variables are fully observable in the training
examples.
– estimate the conditional probabilities.
2. The network structure is given in advance but only
some of the variables are observable in the training
data.
– Similar to learning the weights for the hidden
units of a Neural Net: Gradient Ascent Procedure
3. The network structure is not known in advance.
– Use a heuristic search or constraint-based technique to
search through potential structures. 239
Logistic Regression
Logistic Regression for classification
Logistic:
• Linear Regression:

• Logistic Regression for


classification:

is called the logistic function or


the sigmoid function.
Sigmoid function properties
• Bounded between 0 and 1
• as
• as
Logistic Regression
• In logistic regression, we learn the conditional distribution
P(y|x)
• Let py(x; ) be our estimate of P(y|x), where is a vector of
adjustable parameters.
• Assume there are two classes, y = 0 and y = 1 and

• Can be written more compactly

• We can used the gradient method

244
Maximize likelihood
• How do we maximize the likelihood? Gradient ascent
– Updates:
Assume one training example (x,y), and take derivatives to
derive the stochastic gradient ascent rule.
Introduction to Support Vector
Machine

248
Support Vector Machines
• SVMs have a clever way to prevent overfitting
• They can use many features without requiring
too much computation.

249
Logistic Regression and Confidence
• Logistic Regression:

• Predict 1 on an input x iff,


equivalently,
• The larger the value of , the larger is the probability, and
higher the confidence.
• Similarly, confident prediction of if
• More confident of prediction from points (instances) located
far from the decision surface.

250
Preventing overfitting with many features
• Suppose a big set of features.
• What is the best separating
line to use?
• Bayesian answer:
– Use all
– Weight each line by its
posterior probability
• Can we approximate the correct
answer efficiently?

251
Support Vectors
• The line that maximizes the
minimum margin.
• This maximum-margin separator is
determined by a subset of the
datapoints.
– called “support vectors”.
– we use the support
vectors to decide which The support vectors are
side of the separator a test indicated by the circles
case is on. around them.

252
Functional Margin
• Functional Margin of a point wrt
– Measured by the distance of a point from the
decision boundary

– Larger functional margin more confidence for


correct prediction
– Problem: w and b can be scaled to make this
value larger
• Functional Margin of training set wrt is

253
Geometric Margin
• For a decision surface P=(a1,a2)
• the vector orthogonal to it is
given by . Q=(b1,b2) →

• The unit length orthogonal 𝑤


vector is

(w,b)

254
Geometric Margin

P=(a1,a2)

Q=(b1,b2) →
𝑤

(w,b)

Geometric margin :
Geometric margin of (w,b) wrt S=
-- smallest of the geometric margins of individual points. 255
Maximize margin width denotes +1
denotes -1
x2
• Assume linearly separable Margin
training examples.
• The classifier with the
maximum margin width is
robust to outliners and thus
has strong generalization
ability

x1

256
Maximize Margin Width
• Maximize subject to
• for
• Scale so that
• Maximizing is the same as minimizing
• Minimizesubject to the constraints
• for all , :
if
if

257
Large Margin Linear Classifier
• Formulation: x2
Margin

1 2 x+
minimize w 1
2 b =
T x+
w = 0 -1
such that x +
+ b b =
T x +
w T x
w
n
yi (w T xi  b) 1 x-

x1

denotes +1
denotes -1 258
Solving the Optimization Problem
1 2
minimize w
2
s.t. yi (w T xi  b) 1

• Optimization problem with convex quadratic objectives and linear


constraints
• Can be solved using QP.
• Lagrange duality to get the optimization problem’s dual form,
– Allow us to use kernels to get optimal margin classifiers to work
efficiently in very high dimensional spaces.
– Allow us to derive an efficient algorithm for solving the above
optimization problem that will typically do much better than generic QP
software.
259
Support Vector Machine
Nonlinear SVM and Kernel
function

260
Non-linear decision surface
• We saw how to deal with datasets which are linearly
separable with noise.
• What if the decision boundary is truly non-linear?
• Idea: Map data to a high dimensional space where it
is linearly separable.
– Using a bigger set of features will make the computation
slow?
– The “kernel” trick to make the computation fast.

261
Non-linear SVMs: Feature Space

Φ : 𝑥 → 𝜙(𝑥 )
Non-linear SVMs: Feature Space

Φ : 𝑥 → 𝜙(𝑥 )

This slide is from www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt


Kernel
• Original input attributes is mapped to a new set of input
features via feature mapping .
• Since the algorithm can be written in terms of the scalar
product, we replace with
• For certain ’s there is a simple operation on two vectors
in the low-dim space that can be used to compute the
scalar product of their two images in the high-dim space

Let the kernel do the work rather than do the scalar


product in the high dimensional space.

264
Nonlinear SVMs: The Kernel Trick
• With this mapping, our discriminant function is now:

g (x) w T  (x)  b    i (xi )T  (x)  b


iSV
• We only use the dot productof feature vectors in both the
training and test.
• A kernel functionis defined as a function that
corresponds to a dot product of two feature vectors in
some expanded feature space:
The kernel trick

Often may be very inexpensive to compute even if may be


extremely high dimensional.
Kernel Example
2-dimensional vectors
let
We need to show that

K(xi,xj) = (1 + xixj)2,
= 1+ xi12xj12 + 2 xi1xj1xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2].[1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]
= φ(xi).φ(xj),
where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
Commonly-used kernel functions
• Linear kernel:
• Polynomial of power p:

• Gaussian (radial-basis function):

• Sigmoid

In general, functions that satisfy Mercer’s condition


can be kernel functions.
268
Kernel Functions
• Kernel function can be thought of as a similarity measure
between the input objects
• Not all similarity measure can be used as kernel function.
• Mercer's condition states that any positive semi-definite
kernel K(x, y), i.e.

• can be expressed as a dot product in a high dimensional


space.

269
SVM examples

© Eric Xing @ CMU, 2006-2010


Examples for Non Linear SVMs –Gaussian
Kernel

© Eric Xing @ CMU, 2006-2010


Nonlinear SVM: Optimization
 Formulation: (Lagrangian Dual Problem)
n
1 n n
maximize   i     i j yi y j K ( xi , x j )
i 1 2 i 1 j 1

such that 0  i C
n

 y
i 1
i i 0

 The solution of the discriminant function is


Performance
• Support Vector Machines work very well in practice.
– The user must choose the kernel function and its
parameters
• They can be expensive in time and space for big datasets
– The computation of the maximum-margin hyper-plane
depends on the square of the number of training cases.
– We need to store all the support vectors.
• The kernel trick can also be used to do PCA in a much
higher-dimensional space, thus giving a non-linear version of
PCA in the original space.
Multi-class classification
• SVMs can only handle two-class outputs
• Learn N SVM’s
– SVM 1 learns Class1 vs REST
– SVM 2 learns Class2 vs REST
– :
– SVM N learns ClassN vs REST
• Then to predict the output for a new input, just
predict with each SVM and find out which one puts
the prediction the furthest into the positive region.
Introduction to Neural Network
Introduction
• Inspired by the human brain.
• Some NNs are models of biological neural networks
• Human brain contains a massively interconnected
net of 1010-1011 (10 billion) neurons (cortical cells)
– Massive parallelism – large number of simple
processing units
– Connectionism – highly interconnected
– Associative distributed memory
• Pattern and strength of synaptic connections
Neuron

Neural Unit
ANNs
• ANNs incorporate the two fundamental components of
biological neural nets:
1. Nodes - Neurones
2. Weights - Synapses
Perceptrons
• Basic unit in a neural network: Linear separator
– N inputs, x1 ... xn
– Weights for each input, w1 ... wn
– A bias input x0 (constant) and associated weight w0
– Weighted sum of inputs,
– A threshold function, i.e., if y > 0, if y <= 0
x0
x1 w0
w1
x2 w2

⋮ Σ 𝜑=¿
𝑦=∑ 𝑤 𝑖 𝑥 𝑖
wn
xn
Perceptron training rule
Updates perceptron weights for a training ex as follows:

• If the data is linearly separable and is sufficiently small, it will converge


to a hypothesis that classifies all training data correctly in a finite
number of iterations
Gradient Descent
Linear neurons
• The neuron has a real- • Define the error as the
valued output which is a squared residuals summed
weighted sum of its inputs over all training cases:

• Differentiate to get error derivatives for weights

• The batch delta rule changes the weights in proportion to


their error derivatives summed over all training cases
Error Surface
• The error surface lies in a space with a horizontal axis for each
weight and one vertical axis for the error.
– For a linear neuron, it is a quadratic bowl.
– Vertical cross-sections are parabolas.
– Horizontal cross-sections are ellipses.
Batch Line and Stochastic Learning
Batch Learning Stochastic/ Online Learning
• Steepest descent on the For each example compute the
error surface gradient.
Computation at Units
• Compute a 0-1 or a graded function of the
weighted sum of the inputs
•  () is the activation function
x1 w1
x2 w2
   ( w.x)

wn
xn w.x  wi xi
Neuron Model: Logistic Unit
1 1
 (z)  z

1 e 1  e  w .x

Training Rule:
Multi-layer Neural Network
Limitations of Perceptrons
• Perceptrons have a monotinicity property:
If a link has positive weight, activation can only increase as the
corresponding input value increases (irrespective of other
input values)
• Can’t represent functions where input interactions can cancel
one another’s effect (e.g. XOR)
• Can represent only linearly separable functions
A solution: multiple layers
output layer
y y

z2

hidden layer
z1 z2 z1

x2

input layer
x1 x2 x1
Power/Expressiveness of Multilayer
Networks
• Can represent interactions among inputs
• Two layer networks can represent any Boolean
function, and continuous functions (within a
tolerance) as long as the number of hidden units is
sufficient and appropriate activation functions used
• Learning algorithms exist, but weaker guarantees
than perceptron learning algorithms
Multilayer Network

Outputls
Inputs

First Second
Input hidden hidden Output
layer layer layer
Two-layer back-propagation neural network
Input signals
1
x1 1 y1
1
2
x2 2 y2
2

i wij j wjk
xi k yk

n1
n n2 yn2
xn
Input Hidden Output
layer layer

Error signals

292
The back-propagation training algorithm
• Step 1: Initialisation
Set all the weights and threshold levels of the network to
random numbers uniformly distributed inside a small range
1

v01
v11 1
x1 1 1 w11
v21 w01

1 y1
v22
x2 2 2 w21
v22
Input v02 Output

1
x z y
Backprop
• Initialization
– Set all the weights and threshold levels of the network to
random numbers uniformly distributed inside a small
range
• Forward computing:
– Apply an input vector x to input units
– Compute activation/output vector z on hidden layer
)
– Compute the output vector y on output layer
)
y is the result of the computation.
Learning for BP Nets
• Update of weights in W (between output and hidden layers):
– delta rule
• Not applicable to updating V (between input and hidden)
– don’t know the target values for hidden units z1, Z2, … ,ZP
• Solution: Propagate errors at output units to hidden units to
drive the update of weights in V (again by delta rule)
(error BACKPROPAGATION learning)
• Error backpropagation can be continued downward if the net
has more than one hidden layer.
• How to compute errors on hidden units?
Derivation
• For one output neuron, the error function is

• For each unit , the output is defined as

The input to a neuron is the weighted sum of


outputs of previous neurons.
• Finding the derivative of the error:
Derivation
• Finding the derivative of the error:

Consider as as a function of the inputs of all


neurons receiving input from neuron

taking the total derivative with respect to , a


recursive expression for the derivative is
obtained:
• Therefore, the derivative with respect to can be
calculated if all the derivatives with respect to the
outputs of the next layer – the one closer to the
output neuron – are known.
• Putting it all together:

With

To update the weight using gradient descent, one must


choose a learning rate .
Neural Network and Backpropagation
Algorithm
Single layer Perceptron
• Single layer perceptrons learn o x
linear decision boundaries
x2
0 0
0 0 o o
+ + 0 0
+
+ ++
0 x: class I (y = 1)
o: class II (y = -1)
x1
x x
x2

+ 0
o x

0
+
x: class I (y = 1)
x1 o: class II (y = -1)
xor
x2

Boolean OR + +
OR

- + x1
input input
ouput
x1 x2
w0= -0.5
0 0 0
0 1 1 1
w1=1 w2=1
1 0 1
1 1 1 x1 x2
x2

Boolean AND - +

AND
input input x1
ouput - -
x1 x2
w0= -1.5
0 0 0
0 1 0 1
w1=1 w2=1
1 0 0
1 1 1 x1 x2
x2

Boolean XOR
+ -

XOR
input input
ouput
x1 x2 x1
- +
0 0 0
0 1 1
1 0 1
1 1 0
Boolean XOR
XOR

o -0.5

input input 1 -1
ouput
x1 x2
OR AND
0 0 0 -0.5 h1 h1 -1.5
0 1 1
1
1 0 1 1
1 1
1 1 0
x1 x1
Representation Capability of NNs
• Single layer nets have limited representation power (linear
separability problem). Multi-layer nets (or nets with non-
linear hidden units) may overcome linear inseparability
problem.
• Every Boolean function can be represented by a network
with a single hidden layer.
• Every bounded continuous function can be approximated
with arbitrarily small error, by network with one hidden layer
• Any function can be approximated to arbitrary accuracy by a
network with two hidden layers.
Multilayer Network

Outputls
Inputs

First Second
Input hidden hidden Output
layer layer layer
Two-layer back-propagation neural network
Input signals
1
x1 1 y1
1
2
x2 2 y2
2

i wij j wjk
xi k yk

n1
n n2 yn2
xn
Input Hidden Output
layer layer

Error signals

307
Derivation
• For one output neuron, the error function is

• For each unit , the output is defined as

The input to a neuron is the weighted sum of


outputs of previous neurons.
• Finding the derivative of the error:
For one output neuron, the error function is
For each unit , the output is defined as

with

To update the weight using gradient descent,


one must choose a learning rate .
Backpropagation Algorithm
Initialize all weights to small random numbers.
Until satisfied, do
– For each training example, do
• Input the training example to the network and compute the
network outputs
• = input
For each output unit
= target output
• For each hidden unit h = observed unit output
= wt from i to j
• Update each network weight

where
Backpropagation
• Gradient descent over entire network weight vector
• Can be generalized to arbitrary directed graphs
• Will find a local, not necessarily global error minimum
• May include weight momentum

• Training may be slow.


• Using network after training is very fast
Training practices: batch vs. stochastic
vs. mini-batch gradient descent
• Batch gradient descent:
1. Calculate outputs for the entire Too slow to converge
dataset Gets stuck in local minima
2. Accumulate the errors, back-
propagate and update
• Stochastic/online gradient descent: Converges to the solution faster
1. Feed forward a training example Often helps get the system out of
local minima
2. Back-propagate the error and
update the parameters
• Mini-batch gradient descent:
Learning in epochs
Stopping
• Train the NN on the entire training set over and over
again
• Each such episode of training is called an “epoch”

Stopping
1. Fixed maximum number of epochs: most naïve
2. Keep track of the training and validation error
curves.
Overfitting in ANNs
Local Minima

• NN can get stuck in local minima for small networks.


• For most large networks (many weights) local minima rarely occurs.
• It is unlikely that you are in a minima in every dimension
simultaneously.
ANN
• Highly expressive non-linear functions
• Highly parallel network of logistic function units
• Minimizes sum of squared training errors
• Can add a regularization term (weight squared)
• Local minima
• Overfitting
Deep Neural Network
Deep Learning
• Breakthrough results in
– Image classification
– Speech Recognition
– Machine Translation
– Multi-modal learning
Deep Neural Network
• Problem: training networks with many hidden layers
doesn’t work very well
• Local minima, very slow training if initialize with zero
weights.
• Diffusion of gradient.
Hierarchical Representation
• Hierarchical Representation help represent complex functions.
• NLP: character ->word -> Chunk -> Clause -> Sentence
• Image: pixel > edge -> texton -> motif -> part -> object
• Deep Learning: learning a hierarchy of internal
representations
• Learned internal representation at the hidden layers
(trainable feature extractor)
• Feature learning

Trainable Trainable
Input Feature … Feature
Trainable
Classifier
Output
Extractor Extractor
Unsupervised Pre-training
 We will use greedy, layer wise pre-training
 Train one layer at a time
 Fix the parameters of previous hidden layers
 Previous layers viewed as feature extraction
 find hidden unit features that are more common in training
input than in random inputs
Tuning the Classifier
• After pre-training of the layers
– Add output layer
– Train the whole network using
supervised learning (Back propagation)
Deep neural network
• Feed forward NN
• Stacked Autoencoders (multilayer neural net
with target output = input)
• Stacked restricted Boltzmann machine
• Convolutional Neural Network
A Deep Architecture: Multi-Layer
Perceptron
Output Layer
y
Here predicting a supervised target

h3 …
Hidden layers
These learn more
abstract representations h2 …
as you head up
h1 …

Input layer
x …
Raw sensory inputs
A Neural Network
• Training : Back
Propagation of Error
– Calculate total error at
the top
– Calculate contributions to
error at each step going INPUT LAYER HIDDEN LAYER OUTPUT LAYER
backwards
– The weights are modified
as the error is propagated
Training Deep Networks
• Difficulties of supervised training of deep networks
1. Early layers of MLP do not get trained well
• Diffusion of Gradient – error attenuates as it propagates to
earlier layers
• Leads to very slow training
• the error to earlier layers drops quickly as the top layers "mostly"
solve the task
2. Often not enough labeled data available while there may be
lots of unlabeled data
3. Deep networks tend to have more local minima problems
than shallow networks during supervised training

326
Training of neural networks
• Forward Propagation :
– Sum inputs, produce
activation
– feed-forward

Activation Functions examples

INPUT LAYER HIDDEN LAYER OUTPUT LAYER


Non-linearity
Activation Functions
• tanh(x)=

• sigmoid(x) =

• Rectified linear
relu(x) = max(0,x)
- Simplifies backprop
- Makes learning faster
- Make feature sparse
→ Preferred option
Autoencoder
Unlabeledtraining examples
set
a1 {, , . . . },

a2
Set the target values to be
equal to the inputs. =
a3 Network is trained to output
the input (learn identify
function).

Solution may be trivial!


Autoencoders and sparsity
1. Place constraints on the
network, like limiting the
number of hidden units, to
discover interesting structure
about the data.
2. Impose sparsity constraint.
a neuron is “active” if its output
value is close to 1
It is “inactive” if its output value is
close to 0.
constrain the neurons to be inactive
most of the time.
Auto-Encoders

331
Stacked Auto-Encoders
• Do supervised training on the last layer using final
features
• Then do supervised training on the entire network to fine-
tune all weights

e zi
yi 
e j
z

j
332
Convolutional Neural netwoks
• A CNN consists of a number of convolutional and
subsampling layers.
• Input to a convolutional layer is a m x m x r image
where m x m is the height and width of the image and r is
the number of channels, e.g. an RGB image has r=3
• Convolutional layer will have k filters (or kernels)
• size n x n x q
• n is smaller than the dimension of the image and,
• q can either be the same as the number of channels r or
smaller and may vary for each kernel
Convolutional Neural netwoks

Convolutional layers consist of a rectangular grid of neurons


Each neuron takes inputs from a rectangular section of the previous layer
the weights for this rectangular section are the same for each neuron in the
convolutional layer.
Pooling: Using features obtained after
Convolution for Classification
The pooling layer takes small rectangular
blocks from the convolutional layer and
subsamples it to produce a single output
from that block : max, average, etc.
CNN properties
• CNN takes advantage of the sub-structure of
the input
• Achieved with local connections and tied
weights followed by some form of pooling
which results in translation invariant features.

• CNN are easier to train and have many fewer


parameters than fully connected networks with
the same number of hidden units.
Recurrent Neural Network (RNN)
Foundations of Machine Learning
Module 7: Computational
Learning Theory
Part A: Finite Hypothesis Space
Sudeshna Sarkar
IIT Kharagpur
Goal of Learning Theory
• To understand
– What kinds of tasks are learnable?
– What kind of data is required for learnability?
– What are the (space, time) requirements of the learning
algorithm.?
• To develop and analyze models
– Develop algorithms that provably meet desired criteria
– Prove guarantees for successful algorithms

339
Goal of Learning Theory
• Two core aspects of ML
– Algorithm Design. How to optimize?
– Confidence for rule effectiveness on future data.
• We need particular settings (models)
– Probably Approximately Correct (PAC)

C h

C h=
Error region
340
Prototypical Concept Learning Task
• Given
h ❑
– Instances X (e.g.,
– Distribution over X -
+ +
+ + 𝑐 −- -
– Target function c -
– Hypothesis Space Instance space X
– Training Examples S = i.i.d. from
• Determine
– A hypothesis s.t. for all in S?
– A hypothesis s.t. for all in X?
• An algorithm does optimization over S, find hypothesis h.
• Goal: Find h which has small error over
341
Computational Learning Theory
• Can we be certain about how the learning algorithm
generalizes?
• We would have to see all the examples.

• Inductive inference –
h ❑
generalizing beyond the training
data is impossible unless we add -
+ +
+ + 𝑐 −- -
-
more assumptions (e.g., Instance space X
priors over H)
We need a bias!
Function Approximation
• How many labeled examples in order to determine which of the
hypothesis is the correct one?
• All instances in X must be labeled!
• Inductive inference: generalizing beyond the training data is
impossible unless we add more assumptions (e.g., bias)

- ||H|=
c + h1
+ + +
-
- + -
h2
Instance space X
Error of a hypothesis
The true errorof hypothesis h, with respect to the target concept
cand observation distribution is the probability that h will
misclassify an instance drawn according to

In a perfect world, we’d like the true error to be 0.

Bias: Fix hypothesis space H


c may not be in H => Find h close to c
A hypothesis h is approximately correct if
PAC model
• Goal: h has small error over D.

• How often over future instancesdrawn at random from D


• But, can only measure:

How often over training Instances

• Sample Complexity: bound in terms of


Probably Approximately Correct Learning

• PAC Learning concerns efficient learning


• We would like to prove that
– With high probability an (efficient) learning algorithm will
find a hypothesis that is approximately identical to the

• We specify two parameters, 𝜀 and 𝛿 and


hidden target concept.

require that with probability at least (1−𝛿) a


system learn a concept with error at most 𝜀.
Sample Complexity for Supervised Learning
Theorem

labeled examples are sufficient so that with prob. all with have
• inversely linear in 𝜖
• logarithmic in |H|
• 𝜖 error parameter: D might place low weight on certain parts of the
space
• 𝛿 confidence parameter: there is a small chance the examples we
get are not representative of the distribution
Sample Complexity for Supervised Learning

Theorem: labeled examples are sufficient so that with prob. all with have
Proof: Assume k bad hypotheses Hbad={} with

• Fix Prob. consistent with first training example is Prob.


consistent with first m training examples is
• Prob. that at least one consistent with first m training
examples is

• Calculate value of m so that


• Use the fact that , sufficient to set
Sample Complexity: Finite Hypothesis
Spaces Realizable Case
PAC: How many examples suffice to guarantee small
error whp.
Theorem

labeled examples are sufficient so that with prob.


all with have

Statistical Learning Way:


With probability at least all s.t.we have
P (consist ( H bad , D))  H e  m 
 m 
e 
H

 m ln( )
H
  
m   ln  /  (flip inequality)

 H 
 H 
m  ln  /

  
 1 
m  ln  ln H  / 
  
Sample complexity: inconsistent finite ||
• For a single hypothesis to have misleading training error

• We want to ensure that the best hypothesis has error


bounded in this way
– So consider that any one of them could have a large error

• From this we can derive the bound for the number of


samples needed.
Sample Complexity: Finite Hypothesis Spaces

Consistent Case
Theorem

labeled examples are sufficient so that with prob. all


with have

Inconsistent Case
What if there is no perfect h?
Theorem: After m examples, with probability all have
for
Sample complexity: example
• : Conjunction of n Boolean literals. Is PAC-
learnable?

• Concrete examples:
– δ=ε=0.05, n=10 gives 280 examples
– δ=0.01, ε=0.05, n=10 gives 312 examples
– δ=ε=0.01, n=10 gives 1,560 examples
– δ=ε=0.01, n=50 gives 5,954 examples
• Result holds for any consistent learner, such as FindS.
Sample Complexity of Learning
Arbitrary Boolean Functions
• Consider any boolean function over nboolean features
such as the hypothesis space of DNF or decision trees.
There are 22^n of these, so a sufficient number of examples
to learn a PAC concept is:

• δ=ε=0.05, n=10 gives 14,256 examples


• δ=ε=0.05, n=20 gives 14,536,410 examples
• δ=ε=0.05, n=50 gives 1.561x1016 examples

354
Thank You
Concept Learning Task
“Days in which Aldo enjoys swimming”
Example Sky AirTemp Humidity Wind Water Forecast EnjoySport
1 Sunny Warm Normal Strong Warm Same Yes
2 Sunny Warm High Strong Warm Same Yes
3 Rainy Cold High Strong Warm Change No
4 Sunny Warm High Strong Cool Change Yes

• Hypothesis Representation: Conjunction of constraints on the


6 instance attributes
• “?” : any value is acceptable
• specify a single required value for the attribute
• “” : that no value is acceptable
Concept Learning

h = (?, Cold, High, ?, ?, ?)


indicates that Aldo enjoys his favorite sport on
cold days with high humidity

Most general hypothesis: (?, ?, ?, ?, ?, ? )


Most specific hypothesis: (, , , , , )
Find-S Algorithm
1. Initialize h to the most specific hypothesis in
2. For each positive training instance x
For each attribute constraint ai in h
IF the constraint ai in h is satisfied by x
THEN do nothing
ELSE replace ai in h by next more general
constraint satisfied by x
3. Output hypothesis h
Concept Learning
Example Sky AirTemp Humidity Wind Water Forecast EnjoySport
1 Sunny Warm Normal Strong Warm Same Yes
2 Sunny Warm High Strong Warm Same Yes
3 Rainy Cold High Strong Warm Change No
4 Sunny Warm High Strong Cool Change Yes

Finding a Maximally Specific Hypothesis


Find-S Algorithm
h1  (, , , , , )
h2  (Sunny, Warm, Normal, Strong, Warm, Same)
h3  (Sunny, Warm, ?, Strong, Warm, Same)
h4  (Sunny, Warm, ?, Strong, ?, ?)
Back
Thank You
Foundations of Machine Learning
Module 7: Computational
Learning Theory
Part A
Sudeshna Sarkar
IIT Kharagpur
Sample Complexity: Infinite Hypothesis
Spaces
• Need some measure of the expressiveness of infinite
hypothesis spaces.
• The Vapnik-Chervonenkis (VC) dimension provides
such a measure, denoted VC(H).
• Analagous to ln|H|, there are bounds for sample
complexity using VC(H).
Shattering
• Consider a hypothesis for the 2-class problem.
• A set of points (instances) can be labeled as or in
ways.
• If for every such labeling a function can be found in
consistent with this labeling, we set that the set of
instances is shattered by .
Three points in R2
• It is enough to find one set of three points that can be
shattered.
• It is not necessary to be able to shatter every possible set of
three points in 2 dimensions
Shattering Instances
• Consider 2 instances described using a single real-
valued feature being shattered by a single
interval.

x y
Shattering Instances (cont)
But 3 instances cannot be shattered by a single interval.
x y z

x y z + –
_
x,y,z
x y,z
y x,z
x,y z
x,y,z
y,z x
Cannot do z x,y
x,z y

367
VC Dimension
• The Vapnik-Chervonenkis dimension, VC(H). of hypothesis space H
defined over instance space X is the size of the largest finite subset
of X shattered by H. If arbitrarily large finite subsets of X can be
shattered then VC(H) = 

• If there exists at least one subset of X of size d that can be shattered


then VC(H) ≥ d.
• If no subset of size d can be shattered, then VC(H) < d.

• For a single intervals on the real line, all sets of 2 instances can be
shattered, but no set of 3 instances can, so VC(H) = 2.
VC Dimension
• An unbiased hypothesis space shatters the entire instance
space.
• The larger the subset of X that can be shattered, the more
expressive (and less biased) the hypothesis space is.
• The VC dimension of the set of oriented lines in 2-d is
three.

• Since there are 2m partitions of m instances, in order for H


to shatter instances: |H| ≥ 2m.
• Since |H| ≥ 2m, to shatter m instances, VC(H) ≤ log2|H|

369
VC Dimension Example
Consider axis-parallel rectangles in the real-plane,
i.e. conjunctions of intervals on two real-valued
features. Some 4 instances can be shattered.

Some 4 instances cannot be shattered:


VC Dimension Example (cont)
• No five instances can be shattered since there can be at most
4 distinct extreme points (min and max on each of the 2
dimensions) and these 4 cannot be included without including
any possible 5th point.

• Therefore VC(H) = 4
• Generalizes to axis-parallel hyper-rectangles (conjunctions of
intervals in n dimensions): VC(H)=2n.

371
Upper Bound on Sample Complexity with VC

• Using VC dimension as a measure of expressiveness, the


following number of examples have been shown to be
sufficient for PAC Learning (Blumer et al., 1989).
1  2  13  
 4 log 2    8VC ( H ) log 2   
     

• Compared to the previous result using ln|H|, this bound has


some extra constants and an extra log2(1/ε) factor. Since VC(H)
≤ log2|H|, this can provide a tighter upper bound on the
number of examples needed for PAC learning.

372
Sample Complexity Lower Bound with VC
• There is also a general lower bound on the minimum number of
examples necessary for PAC learning (Ehrenfeucht, et al., 1989):
Consider any concept class C such that ,any learner and any
Then there exists a distribution D and target concept in C such that
if L observes fewer than:

1  1  VC (C )  1 
max log 2   , 
   32 
examples, then with probability at least δ, L outputs a hypothesis
having error greater than ε.
• Ignoring constant factors, this lower bound is the same as the
upper bound except for the extra log2(1/ ε) factor in the upper
bound.
373
Foundations of Machine Learning

Module 8: Ensemble Learning


Part A

Sudeshna Sarkar
IIT Kharagpur
What is Ensemble Classification?
• Use multiple learning algorithms (classifiers)
• Combine the decisions
• Can be more accurate than the individual classifiers
• Generate a group of base-learners
• Different learners use different
– Algorithms
– Hyperparameters
– Representations (Modalities)
– Training sets
Why should it work?
• Works well only if the individual classifiers
disagree
– Error rate < 0.5 and errors are independent
– Error rate is highly correlated with the correlations
of the errors made by the different learners
Bias vs. Variance
• We would like low bias error and low variance error
• Ensembles using multiple trained (high variance/low
bias) models can average out the variance, leaving
just the bias
– Less worry about overfit (stopping criteria, etc.)
with the base models
Combining Weak Learners
• Combining weak learners
– Assume n independent models, each having accuracy of 70%.
– If all n give the same class output then you can be confident it
is correct with probability 1-(1-.7)n.
– Normally not completely independent, but unlikely that all n
would give the same output
• Accuracy better than the base accuracy of the models by using the
majority output.
– If n1 models say class 1 and n2<n1 models say class 2, then
P(class1) = 1 – Binomial(n, n2, .7)
Ensemble Creation Approaches
• Get less correlated errors between models
– Injecting randomness
• initial weights (eg, NN), different learning parameters,
different splits (eg, DT) etc.
– Different Training sets
• Bagging, Boosting, different features, etc.
– Forcing differences
• different objective functions
– Different machine learning model
Ensemble Combining Approaches
• Unweighted Voting (e.g. Bagging)
• Weighted voting – based on accuracy (e.g. Boosting),
Expertise, etc.
• Stacking - Learn the combination function
Combine Learners: Voting
• Unweighted voting
• Linear combination
(weighted vote)
• weight accuracy
• weight

• Bayesian
Fixed Combination Rules
Bayes Optimal Classifier
• The Bayes Optimal Classifier is an ensemble of all the
hypotheses in the hypothesis space.
• On average, no other ensemble can outperform it.
• The vote for each hypothesis
– proportional to the likelihood that the training dataset
would be sampled from a system if that hypothesis were
true.
– is multiplied by the prior probability of that hypothesis.
• y is the predicted class,
• C is the set of all possible classes,
• H is the hypothesis space,
• T is the training data.
The Bayes Optimal Classifier represents a hypothesis
that is not necessarily in H.
But it is the optimal hypothesis in the ensemble space.
Practicality of Bayes Optimal Classifier
• Cannot be practically implemented.
• Most hypothesis spaces are too large
• Many hypotheses output a class or a value, and not
probability
• Estimating the prior probability for each hypothesizes
is not always possible.
BMA
• All possible models in the model space used
weighted by their probability of being the “Correct”
model
• Optimal given the correct model space and priors
Why are Ensembles Successful?
• Bayesian perspective:
P C i | x    P C | x , M P M 
all models M
i j j
j

• If dj are independent
 1  1   1 1
Var y  Var   d j   2 Var   d j   2 L Var d j   Var d j 
  
 j L  L  j  L L

Bias does not change, variance decreases by L


• If dependent, error increase with positive correlation
1   1 
Var y   2 Var   d j   2   Var d j  2 Cov (di , d j )
L  j  L  j j i j 
Challenge for developing Ensemble Models

• The main challenge is to obtain base models which are


independent and make independent kinds of errors.
• Independence between two base classifiers can be assessed
in this case by measuring the degree of overlap in training
examples they misclassify
(|AB|/|AB|)
Thank You
Foundations of Machine Learning

Module 8: Ensemble Learning


Part B: Bagging and Boosting

Sudeshna Sarkar
IIT Kharagpur
Bagging
• Bagging = “bootstrap aggregation”
– Draw N items from X with replacement
• Desired learners with high variance (unstable)
– Decision trees and ANNs are unstable
– K-NN is stable
• Use bootstrapping to generate L training sets and
train one base-learner with each (Breiman, 1996)
• Use voting
Bagging
• Sampling with replacement

• Build classifier on each bootstrap sample

• Each sample has probability (1 – 1/n)n of being


selected
Boosting
• An iterative procedure. Adaptively change distribution of
training data.
– Initially, all N records are assigned equal weights
– Weights change at the end of boosting round
• On each iteration t:
– Weight each training example by how incorrectly it was classified
– Learn a hypothesis:
– A strength for this hypothesis:
• Final classifier:
– A linear combination of the votes of the different classifiers
weighted by their strength
• “weak” learners
– P(correct) > 50%, but not necessarily much better
Adaboost
• Boosting can turn a weak algorithm into a strong
learner.
• Input: S={}
• : weight of ith training example
• Weak learner A
• For
– Construct on {}
– Run A on producing
error of over
Given: where
Initialize .
For
– Train weak learner using distribution .
– Get weak classifier
– Choose
– Update:

Where is a normalization factor

Output the final classifier:


Given: where
Initialize .
For
– Train weak learner using distribution .
– Get weak classifier
– Choose
– Update: Choose to minimize training error

Where is a normalization factor where

Output the final classifier:


Strong weak classifiers
• If each classifiers is (at least slightly) better
than random

• Icanbe shown that AdaBoost will achieve zero


training error (expotentially fast):
Illustrating AdaBoost
Initial weights for each data point Data points
for training

0.1 0.1 0.1


Original
Data +++ - - - - - ++

B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - -  = 1.9459
Illustrating AdaBoost
B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - -  = 1.9459

B2
0.3037 0.0009 0.0422
Boosting
Round 2 - - - - - - - - ++  = 2.9323

B3
0.0276 0.1819 0.0038
Boosting
Round 3 +++ ++ ++ + ++  = 3.8744

Overall +++ - - - - - ++
Thank You
Foundations of Machine Learning

Module 9: Clustering
Part A: Introduction and kmeans

Sudeshna Sarkar
IIT Kharagpur
Unsupervised learning
• Unsupervised learning:
– Data with no target attribute. Describe hidden structure from
unlabeled data.
– Explore the data to find some intrinsic structures in them.
• Clustering: the task of grouping a set of objects in such a way
that objects in the same group (called a cluster) are more
similar to each other than to those in other clusters.
• Useful for
– Automatically organizing data.
– Understanding hidden structure in data.
– Preprocessing for further analysis.
403
Applications: News Clustering (Google)
Gene Expression Clustering
Other Applications
• Biology: classification of plants and animal kingdom
given their features
• Marketing: Customer Segmentation based on a
database of customer data containing their
properties and past buying records
• Clustering weblog data to discover groups of similar
access patterns.
• Recognize communities in social networks.
An illustration
• This data set has four natural clusters.

407
An illustration
• This data set has four natural clusters.

408
Aspects of clustering
• A clustering algorithm such as
– Partitional clustering eg, kmeans The quality of a
– Hierarchical clustering eg, AHC clustering result
– Mixture of Gaussians depends on the
• A distance or similarity function algorithm, the
distance function,
– such as Euclidean, Minkowski, cosine
and the application.
• Clustering quality
– Inter-clusters distance  maximized
– Intra-clusters distance  minimized

409
Major Clustering Approaches
• Partitioning: Construct various partitions and then evaluate
them by some criterion
• Hierarchical: Create a hierarchical decomposition of the set of
objects using some criterion
• Model-based: Hypothesize a model for each cluster and find
best fit of models to data
• Density-based: Guided by connectivity and density functions
• Graph-Theoretic Clustering

410
Partitioning Algorithms
• Partitioning method: Construct a partition of a
database D of m objects into a set of k clusters
• Given a k, find a partition of k clusters that optimizes
the chosen partitioning criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic method: k-means (MacQueen, 1967)

411
Hierarchical Clustering
animal

vertebrate invertebrate

fish reptile amphib. mammal worm insect crustacean

• Produce a nested sequence of clusters.


• One approach: recursive application of a partitional
clustering algorithm.
Model Based Clustering
• A model is hypothesized
• e,g., Assume data is
generated by a mixture of
underlying probability
distributions
• Fit the data to model
Density based Clustering
• Based on density
connected points
• Locates regions of high
density separated by
regions of low density
• e.g., DBSCAN
Graph Theoretic Clustering
• Weights of edges
between items (nodes)
based on similarity
• E.g., look for minimum
cut in a graph
(Dis)similarity measures
• Distance metric (scale-dependent)
– Minkowski family of distance measures

Manhattan (p=1), Euclidean (p=2)


– Cosine distance
(Dis)similarity measures
• Correlation coefficients (scale-invariant)
• Mahalanobis distance

• Pearson correlation
Quality of Clustering
• Internal evaluation:
– assign the best score to the algorithm that produces clusters with high
similarity within a cluster and low similarity between clusters, e.g.,
Davies-Bouldin index

• External evaluation:
– evaluated based on data such as known class labels and external
benchmarks, eg, Rand Index, Jaccard Index, f-measure
Thank You
Foundations of Machine Learning

Module 9: Clustering
Part C: Hierarchical Clustering

Sudeshna Sarkar
IIT Kharagpur
Hierarchical Clustering
animal

vertebrate invertebrate

fish reptile amphib. mammal worm insect crustacean

• Produce a nested sequence of clusters.


• One approach: recursive application of a partitional
clustering algorithm.
Types of hierarchical clustering
• Agglomerative (bottom up) clustering: It builds the
dendrogram (tree) from the bottom level, and
– merges the most similar (or nearest) pair of clusters
– stops when all the data points are merged into a single cluster
(i.e., the root cluster).
• Divisive (top down) clustering: It starts with all data
points in one cluster, the root.
– Splits the root into a set of child clusters. Each child cluster is
recursively divided further
– stops when only singleton clusters of individual data points
remain, i.e., each cluster with only a single point

422
Dendrogram: Hierarchical Clustering
Dendrogram
0.2
– Given an input set S
0.15

– nodes represent subsets


0.1

of S
0.05

– Features of the tree:


0
1 3 2 5 4 6
– The root is the whole
5
input set S.
6

4 – The leaves are the


individual elements of S.
3 4
2
5
2
– The internal nodes are
3
1
1 defined as the union of
their children.
423
423
Dendrogram: Hierarchical Clustering
Dendrogram
– Each level of the tree
represents a
partition of the input
data into several
(nested) clusters or
groups.
– May be cut at any
level: Each
connected
component forms a
cluster.
424
424
Hierarchical clustering
Hierrarchical Agglomerative clustering
• Initially each data point forms a cluster.
• Compute the distance matrix between the
clusters.
• Repeat
– Merge the two closest clusters
– Update the distance matrix
• Until only a single cluster remains.
Different definitions of the distance leads to
different algorithms.
426
Initialization
• Each individual point is taken as a cluster
• Construct distance/proximity matrix
p1 p2 p3 p4 p5 ...
p1

p2
p3

p4
p5
.
.
.
Distance/Proximity Matrix

...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate State
• After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1

C2
C3
C3
C4
C4
C5
C1
Distance/Proximity Matrix

C2 C5

...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate State
Merge the two closest clusters (C2 and C5) and update the
distance matrix.
C1 C2 C3 C4 C5
C1

C3 C2
C3
C4
C4
C5
C1
Distance/Proximity Matrix

C2 C5

...
p1 p2 p3 p4 p9 p10 p11 p12
After Merging
• Update the distance matrix
C2
U
C1 C5 C3 C4

C3 C1 ?

C4 C2 U C5 ? ? ? ?

C3 ?

C1 C4 ?

C2 U C5

...
p1 p2 p3 p4 p9 p10 p11 p12
Closest Pair
• A few ways to measure distances of two clusters.
• Single-link
– Similarity of the most similar (single-link)
• Complete-link
– Similarity of the least similar points
• Centroid
– Clusters whose centroids (centers of gravity) are the
most similar
• Average-link
– Average cosine between pairs of elements
431
Distance between two clusters
• Single-link distance between clusters Ci and Cj
is the minimum distance between any object
in Ci and any object in Cj

sim(Ci ,C j )  max sim( x, y )


xCi , yC j
Single Link Example
It Can result in
“straggly” (long and
thin) clusters due to
chaining effect.

433
Single-link clustering: example
• Determined by one pair of points, i.e., by one
link in the proximity graph.

I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Complete link method

• The distance between two clusters is the distance of


two furthest data points in the two clusters.
sim(ci ,c j )  min sim( x, y )
xci , yc j

• Makes “tighter,” spherical clusters that are typically preferable.


• It is sensitive to outliers because they are far away

435
Complete-link clustering: example
• Distance between clusters is determined by
the two most distant points in the different
clusters

I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Complete Link Example

437
Computational Complexity
• In the first iteration, all HAC methods need to
compute similarity of all pairs of N initial instances,
which is O(N2).
• In each of the subsequent N2 merging iterations,
compute the distance between the most recently
created cluster and all other existing clusters.
• In order to maintain an overall O(N2) performance,
computing similarity to each other cluster must be
done in constant time.
– Often O(N3) if done naively or O(N2 log N) if done more
cleverly 438
Average Link Clustering
• Similarity of two clusters = average similarity between
any object in Ci and any object in Cj
1  
sim(ci , c j ) 
Ci C j


 sim( x, y)

xCi yC j
• Compromise between single and complete link. Less
susceptible to noise and outliers.
• Two options:
– Averaged across all ordered pairs in the merged
cluster
– Averaged over all pairs between the two original
clusters
439
The complexity
• All the algorithms are at least O(n2). n is the
number of data points.
• Single link can be done in O(n2).
• Complete and average links can be done in
O(n2logn).
• Due the complexity, hard to use for large data
sets.

440
Model-based clustering
• Assume data generated from probability
distributions
• Goal: find the distribution parameters
• Algorithm: Expectation Maximization (EM)
• Output: Distribution parameters and a soft
assignment of points to clusters
Model-based clustering
• Assume probability distributions with parameters
• Given data , compute such that
) [likelihood]or
[log likelihood]
is maximized.
• Every point may be generated by multiple
distributions with some probability
EM Algorithm
• Initialize the parameters randomly
• Let each parameter corresponds to a cluster center (mean)
• Iterate between two steps
– Expectation step: (probabilistically) assign points to
clusters

– Maximation step: estimate model parameters that


maximize the likelihood for the given assignment of
points
EM Algorithm
Expectation step: (probabilistically) assign points to clusters
compute Prob(point|mean)
Prob(mean|point) =
Prob(mean) Prob(point|mean) / Prob(point)

Maximation step: estimate model parameters that maximize


the likelihood for the given assignment of points
Each mean = Weighted avg. of points
Weight = Prob(mean|point)
EM Algorithm
• Initialize cluster centers
• Iterate between two steps
– Expectation step: assign points to clusters

– Maximization step: estimate model parameters


K-means Algorithm
• Goal: represent a data set in terms of K
clusters each of which is summarized by a
prototype
• Initialize prototypes, then iterate between two
phases:
– E-step: assign each data point to nearest prototype
– M-step: update prototypes to be the cluster means
Thank You
Foundations of Machine Learning

Module 9: Clustering
Part B: kmeans clustering

Sudeshna Sarkar
IIT Kharagpur
Partitioning Algorithms
• Given k
• Construct a partition of m objects
where is a vector in a real-valued space ,n is the number of attributes.
• into a set of k clusters
• The cluster mean serves as a prototype of the cluster .
• Find k clusters that optimizes a chosen criterion
– E.g., the within-cluster sum of squares (WCSS)
(sum of distance functions of each point in the cluster to the
cluster mean)

Heuristic method: k-means (MacQueen, 1967) 449


K-means algorithm
Given k
1. Randomly choose k data points (seeds) to be the initial
cluster centres
2. Assign each data point to the closest cluster centre
3. Re-compute the cluster centres using the current
cluster memberships.
4. If a convergence criterion is not met, go to 2.

450
Stopping/convergence criterion
OR
1. no re-assignments of data points to different
clusters
2. no (or minimum) change of centroids
3. minimum decrease in the sum of squared error

451
Kmeans illustrated
Similarity / Distance measures
• Distance metric (scale-dependent)
– Minkowski family of distance measures

Manhattan (p=1), Euclidean (p=2)


– Cosine distance
Similarity / Distance measures
• Correlation coefficients (scale-invariant)
• Mahalanobis distance

• Pearson correlation
Convergence of K-Means
• Recomputation monotonically decreases
each square error since
( is number of members in cluster j):
reaches minimum for:

• K-means typically converges quickly

455
Time Complexity
• Computing distance between two items is
O(n) where n is the dimensionality of the
vectors.
• Reassigning clusters: O(km) distance
computations, or O(kmn).
• Computing centroids: Each item gets added
once to some centroid: O(mn).
• Assume these two steps are each done once
for t iterations: O(tknm).
456
Advantages
• Fast, robust easy to understand.
• Relatively efficient: O(tkmn)
• Gives best result when data set are distinct or
well separated from each other.
Disadvantages
• Requires apriori specification of the number of
cluster centers.
• Hard assignment of data points to clusters
• Euclidean distance measures can unequally
weight underlying factors.
• Applicable only when mean is defined i.e. fails
for categorical data.
• Only local optima
K-Means on RGB image
x1={r1, g1, b1} Classification Results
x1C(x1)
x2={r2, g2, b2}
x2C(x2)
… …
Classifier
xi={ri, gi, bi} xiC(xi)
(K-Means)


Cluster Parameters
1 for C1
2for C2

k for Ck

example from Bishop’s Book


459
BCS Summer School, example from Bishop’s Book
 Christopher M.
Exeter, 2003 Bishop
BCS Summer School, example from Bishop’s Book
 Christopher M.
Exeter, 2003 Bishop
BCS Summer School, example from Bishop’s Book
 Christopher M.
Exeter, 2003 Bishop
BCS Summer School, example from Bishop’s Book
 Christopher M.
Exeter, 2003 Bishop
BCS Summer School, example from Bishop’s Book
 Christopher M.
Exeter, 2003 Bishop
BCS Summer School,  Christopher M.
Exeter, 2003 Bishop
BCS Summer School, example from Bishop’s Book
 Christopher M.
Exeter, 2003 Bishop
BCS Summer School, example from Bishop’s Book
 Christopher M.
Exeter, 2003 Bishop
BCS Summer School, example from Bishop’s Book
 Christopher M.
Exeter, 2003 Bishop
Model-based clustering
• Assume probability distributions with parameters
• Given data , compute such that
) [likelihood]or
[log likelihood]
is maximized.
• Every point may be generated by multiple
distributions with some probability
EM Algorithm
• Initialize the parameters randomly
• Let each parameter corresponds to a cluster center (mean)
• Iterate between two steps
– Expectation step: (probabilistically) assign points to
clusters

– Maximation step: estimate model parameters that


maximize the likelihood for the given assignment of
points
EM Algorithm
Expectation step: (probabilistically) assign points to clusters
compute Prob(point|mean)
Prob(mean|point) =
Prob(mean) Prob(point|mean) / Prob(point)

Maximation step: estimate model parameters that maximize


the likelihood for the given assignment of points
Each mean = Weighted avg. of points
Weight = Prob(mean|point)
EM Algorithm
• Initialize cluster centers
• Iterate between two steps
– Expectation step: assign points to clusters

– Maximization step: estimate model parameters


K-means Algorithm
• Goal: represent a data set in terms of K
clusters each of which is summarized by a
prototype
• Initialize prototypes, then iterate between two
phases:
– E-step: assign each data point to nearest prototype
– M-step: update prototypes to be the cluster means
Thank You

You might also like