Presentation on ML - Copy
Presentation on ML - Copy
Data
Computer Output
Program
Data
Computer Program
Output
Machine Learning : Definition
• Learning is the ability to improve one's
behaviour based on experience.
• Build computer systems that automatically
improve with experience
• What are the fundamental laws that govern all
learning processes?
• Machine Learning explores algorithms that can
– learn from data / build a model from data
– use the model for prediction, decision making
or solvingsome tasks
Machine Learning : Definition
•A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P, if its
performance at tasks in T, as measured by P,
improves with experience E. [Mitchell]
Components of a learning problem
• Task: The behaviour or task being improved.
– For example: classification, acting in an
environment
• Data: The experiences that are being used to
improve performance in the task.
• Measure of improvement :
– For example: increasing accuracy in prediction,
acquiring new, improved speed and efficiency
Black-box Learner
Experiences Data Problem/Task
Models
Learner Reasoner
Background knowledge/
Bias Answer/ Performance
Many domains and applications
Medicine:
• Diagnose a disease
– Input: symptoms, lab measurements, test
results, DNA tests,
– Output: one of set of possible diseases, or
“none of the above”
• Data: historical medical records
• Learn: which future patients will respond best to
which treatments
Vision:
• say what objects appear in an image
• convert hand-written digits to characters 0..9
• detect where objects appear in an image
Robot control:
• Design autonomous mobile robots that learn
from experience to
– Play soccer
– Navigate from their own experience
NLP:
• detect where entities are mentioned in NL
• detect what facts are expressed in NL
• detect if a product/movie review is positive,
negative, or neutral
Financial:
• predict if a stock will rise or fall
• predict if a user will click on an ad or not
Speech recognition
Machine translation
Application in Business Intelligence
• Forecasting product sales quantities taking
seasonality and trend into account.
• Identifying cross selling promotional
opportunities for consumer goods.
•…
Some other applications
• Fraud detection : Credit card Providers
• Determine whether or not someone will
default on a home mortgage.
• Understand consumer sentiment based off of
unstructured text data.
• Forecasting women’s conviction rates based
off external macroeconomic factors.
Broad types of machine learning
• Supervised Learning
– X,Y (pre-classified training examples)
– Given an observation x, what is the best label for y?
• Unsupervised learning
–X
– Given a set of x’s, cluster or summarize them
• Semi-supervised Learning
• Reinforcement Learning
– Determine what to do based on rewards and punishments.
Supervised Learning
Given:
– a set of input features X1, … , X𝑛
– A target feature Y
– a set of training examples where the values for the input
features and the target features are given for each example
– a new example, where only the values for the input features
are given
Predict the values for the target features for the new
example.
– classification when Y is discrete
– regression when Y is continuous
Classification
• Example: Credit scoring
Differentiating between low-risk and high-risk
customers from their income and savings
Regression
• Example: Price of a used car
x : car attributes
y = g (x, 𝜃 )
y : price
g ( ) model, 𝜃 parameters
Features
• Often, the individual observations are analyzed
into a set of quantifiable properties which are
called features. May be
– categorical (e.g. "A", "B", "AB" or "O", for
blood type)
– ordinal (e.g. "large", "medium" or "small")
– integer-valued (e.g. the number of words in a
text)
– real-valued (e.g. height)
Classification learning
• Task T:
– input:
– output:
• Performance metric P:
• Experience E:
• Task T:
– input: a set of instances d1,…,dn
• an instance has a set of features
• we can represent an instance as a vector d=<x1,…,xn>
– output: a set of predictions y1,..., yn
• one of a fixed set of constant values:
– {+1,-1} or {cancer, healthy}, or {rose, hibiscus, jasmine,
…}, or …
• Performance metric P:
• Experience E:
Classification Learning
Classification Learning
Representations
1. Decision Tree
2. Linear function
3. Multivariate linear
function
• Output: A hypothesis ℎ ∈ ℋ
Hypothesis Spaces
24(2𝑁)
unless we see every possible input-output pair
Important issues in Machine Learning
y
dependent
variable
(output)
63
Types of Regression Models
Regression
1 feature Models 2+ features
Simple Multiple
Non- Non-
Linear Linear
Linear Linear
Linear regression
• Given an input x compute an
output y
• For example:
Y
- Predict height from age
- Predict house price from
house area
- Predict distance from wall
from sensors
X
Simple Linear Regression Equation
E(y)
Regression line
Intercept
Slope β1
b0
x
Linear Regression Model
Y=
House Number Y: Actual Selling X: House Size (100s
Price ft2)
1 89.5 20.0
2 79.9 14.8
3 83.1 20.5
Sample 15
4 56.9 12.5
houses
5 66.6 18.0
6 82.5 14.3
from the
7 126.3 27.5
region.
8 79.3 16.5
9 119.9 24.3
10 87.6 20.2
11 112.6 22.0
12 120.8 .019
13 78.5 12.3
14 74.3 14.0
15 74.8 16.7
Averages 88.84 18.17
House price vs size
Linear Regression – Multiple Variables
70
Assumption
• The data may not form a perfect line.
• When we actually take a measurement (i.e., observe
the data), we observe:
Yi = 0 + 1Xi + i,
where i is the random error associated with the ith
observation.
Assumptions about the Error
i 0 1i
[ y
i 1
(b b x )]2
75
Multiple Linear Regression
Y 0 1 X 1 2 X 2 n X n
76
Linear Regression
To learn the parameters
• Make h(x) close to y, for the available training
examples.
• Define a cost function
J(
• Find that minimizes J().
LMS Algorithm
• Start a search algorithm (e.g. gradient descent algorithm,)
with initial guess of .
• Repeatedly update to make J() smaller, until it converges to
minima.
𝜕
β j=β j − 𝛼 𝐽 ( 𝜃)
𝜕βj
• Jis a convex quadratic function, so has a single global minima.
gradient descent eventually converges at the global minima.
• At each iteration this algorithm takes a step in the direction of
steepest descent(-ve direction of gradient).
LMS Update Rule
Repeat {
for I = 1 to m do
(for every j)
end for
} until convergence
Delta Rule for Classification
1
z
0
x
1
z
0
x
• What would happen in this adjusted case for perceptron and delta rule and
where would the decision point (i.e. .5 crossing) be?
CS 478 - Regression 82
Delta Rule for Classification
1
z
0
x
1
z
0
x
CS 478 - Regression 83
Delta Rule for Classification
1
z
0
x
1
z
0
x
1
z
0
x
• What would happen if we were doing a regression fit with a sigmoid/logistic
curve rather than a line?
CS 478 - Regression 84
Delta Rule for Classification
1
z
0
x
1
z
0
x
1
z
0
x
• Sigmoid fits many decision cases quite well! This is basically what logistic
regression does.
85
Definition
• A decision tree is a classifier in the form of a
tree structure with two types of nodes:
– Decision node: Specifies a choice or test of
some attribute, with one branch for each
outcome
– Leaf node: Indicates classification of an example
Decision Tree Example 1
Whether to approve a loan
Employed?
No Yes
Credit
Income?
Score?
High Low High Low
New Examples:
e7 ??? known new short work
e8 ??? unknown new short work
Possible splits
skips 9
length reads 9
No Yes No Yes
Decision Tree for PlayTennis
Outlook
No Yes No Yes
Decision Tree
decision trees represent disjunctions of conjunctions
Outlook
No Yes No Yes
(Outlook=Sunny Humidity=Normal)
(Outlook=Overcast)
(Outlook=Rain Wind=Weak)
Searching for a good tree
• How should you go about building a decision tree?
• The space of decision trees is too big for systematic
search.
• Stop and
– return the a value for the target feature or
– a distribution over target feature values
ICS320 103
Principled Criterion
• Selection of an attribute to test at each node -
choosing the most useful attribute for classifying
examples.
• information gain
– measures how well a given attribute separates the training
examples according to their target classification
– This measure is used to select among the candidate
attributes at each step while growing the tree
– Gain is measure of how much we can reduce
uncertainty (Value lies between 0,1)
Entropy
• A measure for
– uncertainty
– purity
– information content
• Information theory: optimal length code assigns (- log2p) bits
to message having probability p
• S is a sample of training examples
– p+ is the proportion of positive examples in S
– p- is the proportion of negative examples in S
• Entropy of S: average optimal number of bits to encode
information about certainty/uncertainty about S
Entropy(S) = p+(-log2p+) + p-(-log2p-) = -p+log2p+- p-
log2p-
Entropy
Humidity Wind
Gain(S,Humidity) Gain(S,Wind)
=0.940-(7/14)*0.985 =0.940-(8/14)*0.811
– (7/14)*0.592 – (6/14)*1.0
=0.151 =0.048
Humidity provides greater info. gain than Wind, w.r.t target classification.
Selecting the Next Attribute
S=[9+,5-]
E=0.940
Outlook
Gain(S,Outlook)
=0.940-(5/14)*0.971
-(4/14)*0.0 – (5/14)*0.0971
=0.247
Selecting the Next Attribute
The information gain values for the 4 attributes are:
• Gain(S,Outlook) =0.247
• Gain(S,Humidity) =0.151
• Gain(S,Wind) =0.048
• Gain(S,Temperature) =0.029
Note: 0Log20 =0
ID3 Algorithm
[D1,D2,…,D14] Outlook
[9+,5-]
? Yes ?
Test for this node
No Yes No Yes
c classes
Sv
GINIsplit (A) S
GINI(N v )
v Values(A )
Splitting Based on Continuous Attributes
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
• Missing Values
• Costs of Classification
Hypothesis Space Search in Decision Trees
• Conduct a search of the space of decision trees which
can represent all possible discrete functions.
119
Overfitting
• Learning a tree that classifies the training data
perfectly may not lead to the tree with the best
generalization performance.
– There may be noise in the training data
– May be based on insufficient data
• A hypothesis h is said to overfit the training
data if there is another hypothesis, h’, such that
h has smaller error than h’ on the training data
but h has larger error on the test data than h’.
Overfitting
• Learning a tree that classifies the training data perfectly may not lead to
the tree with the best generalization performance.
– There may be noise in the training data
– May be based on insufficient data
• A hypothesis h is said to overfit the training data if there is another
hypothesis, h’, such that h has smaller error than h’ on the training data
but h has larger error on the test data than h’.
On training
accuracy On testing
Complexity of tree
Underfitting and Overfitting (Example)
Circular points:
0.5 sqrt(x12+x22) 1
Triangular points:
sqrt(x12+x22) > 0.5 or
sqrt(x12+x22) < 1
Underfitting and Overfitting
Overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting due to Noise
Lack of data points makes it difficult to predict correctly the class labels
of that region
Notes on Overfitting
• Overfitting results in decision trees that are more
complex than necessary
ICS320 128
Pre-Pruning (Early Stopping)
• Evaluate splits before installing them:
– Don’t install splits that don’t look worthwhile
– when no worthwhile splits to install, done
Pre-Pruning (Early Stopping)
• Typical stopping conditions for a node:
– Stop if all instances belong to the same class
– Stop if all the attribute values are the same
• More restrictive conditions:
– Stop if number of instances is less than some user-specified
threshold
– Stop if class distribution of instances are independent of the
available features (e.g., using 2 test)
– Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
Reduced-error Pruning
• A post-pruning, cross validation approach
- Partition training data into “grow” set and “validation” set.
- Build a complete tree for the “grow” data
- Until accuracy on validation set decreases, do:
For each non-leaf node in the tree
Temporarily prune the tree below; replace it by majority vote
Test the accuracy of the hypothesis on the validation set
Permanently prune the node with the greatest increase
in accuracy on the validation test.
• Problem: Uses less data to construct the tree
• Sometimes done at the rules level
133
Triple Trade-Off
• There is a trade-off between three factors:
– Complexity of H, c (H),
– Training set size, N,
– Generalization error, E on new data overfitting
• As N increases, E decreases
• As c (H) increases, first E decreases and then Eincreases
• As c (H)increases, the training error decreases for some time
and then stays constant (frequently at 0)
134
Notes on Overfitting
• overfitting happens when a model is capturing
idiosyncrasies of the data rather than generalities.
– Often caused by too many parameters relative to the
amount of training data.
– E.g. an order-N polynomial can intersect any N+1 data
points
Dealing with Overfitting
• Use more data
• Use a tuning set
• Regularization
• Be a Bayesian
136
Regularization
• In a linear regression model overfitting is
characterized by large weights.
137
Penalize large weights in Linear Regression
• Introduce a penalty term in the loss function.
Regularized Regression
1. (L2-Regularization or Ridge Regression)
2. L1-Regularization
138
Feature Reduction in ML
- The information about the target class is inherent
in the variables.
- Naïve view:
More features
=> More information
=> More discrimination power.
- In practice:
many reasons why this is not the case!
Curse of Dimensionality
• number of training examples is fixed
=> the classifier’s performance usually will
degrade for a large number of features!
Feature Selection
Problem of selecting some subset of features, while
ignoring the rest
Feature Extraction
• Project the original xi, i =1,...,ddimensions to new
dimensions, zj , j =1,...,k
143
Feature Selection Steps
Feature selection is an
optimization problem.
o Step 1: Search the space of
possible feature subsets.
o Step 2: Pick the subset
that is optimal or near-
optimal with respect to
some objective function.
Feature Selection Steps (cont’d)
Search strategies
– Optimum
– Heuristic
– Randomized
Evaluation strategies
- Filter methods
- Wrapper methods
Evaluating feature subset
• Supervised (wrapper method)
– Train using selected subset
– Estimate error on validation dataset
From Wikipedia
Signal to noise ratio
• Difference in means divided by difference in
standard deviation between the two classes
x , x ,, x
1 2 p
N
Original
p=16 p=32 p=64 p=100 Image
Is PCA a good criterion for classification?
• Data variation
determines the
projection direction
• What’s missing?
– Class information
What is a good projection?
Two classes
• Similarly, what is a
overlap
good criterion?
– Separating different
classes
Two classes
are separated
What class information may be useful?
• Between-class distance
– Distance between the centroids
of different classes
Between-class distance
What class information may be useful?
• Between-class distance
– Distance between the centroids of
different classes
• Within-class distance
• Accumulated distance of an instance
to the centroid of its class
m1 m2
2
J w 2 2
s s
1 2
Multiple Classes
• For classes, compute discriminants, project N-
dimensional features into space.
Probability Basics
Event
An event is any
Simple Events collection of one or
The individual outcomes more simple events
are called simple events. 175
Sample Space
• Sample space : the set of all the possible
outcomes of the experiment
– If the experiment is a roll of a six-sided die, then the
natural sample space is {1, 2, 3, 4, 5, 6}
– Suppose the experiment consists of tossing a coin three
times.
180
Random Variable
• A random variable is a function defined on the
sample space
– maps the outcome of a random event into real
scalar values
W
X(w)
w
Discrete Random Variables
• Random variables (RVs) which may take on only a
countable number of distinct values
– e.g., the sum of the value of two dies
50 100
i 0 j 0
P You get i heads AND your friend get j heads 1
Conditional Probability
• P X x Y y is the probability of , given the
occurrence of
– E.g. you get 0 heads, given that your friend gets
3heads
P X x Y y
P X x Y y
P Y y
Law of Total Probability
• Given two discrete RVs X and Y, which take values in
x1 , , xm y1 , , yn
and , We have
P X xi P X x Y y
j i j
P X x Y y P Y y
i j j
j
Marginalization
P X xi P X x Y y
j i j
P X x Y y P Y y
i j j
j
P X x Y y
P X x Y y
P Y y
P Y y j X xi P X xi
P X xi Y y j
P Y y
k j
X xk P X xk
Independent RVs
P X x Y y P X x P Y y X x P Y y
P X x Y y Z z P X x Z z P Y y Z z
More on Conditional Independence
P X x Y y Z z P X x Z z P Y y Z z
P X x Y y, Z z P X x Z z
P Y y X x, Z z P Y y Z z
Continuous Random Variables
• What if X is continuous?
• Probability density function (pdf) instead of
probability mass function (pmf)
• A pdf is any function that describes the
probability density in terms of the input
variable x.
PDF
• Properties of pdf
–
f x 0, x
–
–
f x 1
f x 1 ???
• Actual probability can be obtained by taking
the integral of pdf
– E.g. the probability of X being between 0 and 1 is
1
P 0 X 1 f x dx
0
Cumulative Distribution Function
• FX v P X v
• Discrete RVs
– FX v vi P X vi
• Continuous RVs
v
– FX v
f x dx
d
– FX x f x
dx
Common Distributions
• Normal
1 2
x
f x exp 2 , x
2 2
– E.g. the height of the entire population
0.4
0.35
0.3
0.25
f(x)
0.2
0.15
0.1
0.05
0
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
Multivariate Normal
• Generalization to higher dimensions of the
one-dimensional normal
Covariance Matrix
• 1
f X x1 , , xd d 2 12
2
1 T 1
exp x x
2
Mean
Mean and Variance
• Mean (Expectation): E X
– Discrete RVs: E X vi
vi P X vi
– Continuous RVs: E X
xf x dx
2
• Variance: V X E X
– Discrete RVs:
2
V X
vi
vi P X vi
– Continuous RVs: V X
2
x f x dx
Mean Estimation from Samples
• Given a set of N samples from a distribution,
we can estimate the mean of the distribution
by:
200
Probability for Learning
• Probability for classification and modeling
concepts.
• Bayesian probability
– Notion of probability interpreted as partial belief
• Bayesian Estimation
– Calculate the validity of a proposition
• Based on prior estimate of its probability
• and New relevant evidence
Bayes Theorem
Goal: To determine the most probable hypothesis, given the data D plus any
initial knowledge about the prior probabilities of the various hypotheses in H.
P ( D | h) P ( h)
Bayes Rule: P(h | D)
P( D)
• P(h) = prior probability of hypothesis h
• P(D) = prior probability of training data D
• P(h|D) = probability of h given D (posterior density )
• P(D|h) = probability of D given h (likelihood of D given h)
An Example
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test
returns a correct positive result in only 98% of the cases in which the
disease is actually present, and a correct negative result in only 97% of
the cases in which the disease is not present. Furthermore, .008 of the
entire population have this cancer.
P ( D | h) P ( h)
arg max
hH P( D)
arg max P( D | h) P (h)
hH
Maximum Likelihood (ML) Hypothesis
hMAP arg max P (h | D)
hH
P ( D | h) P ( h)
arg max
hH P( D)
arg max P ( D | h) P (h)
hH
Comments:
Computational intensive
Providing a standard for judging the performance of
learning algorithms
Choosing P(h) and P(D|h) reflects our prior
knowledge about the learning task
Maximum likelihood and least-squared error
• Learn a Real-Valued Function:
• Consider any real-valued target function f.
• Training examples (xi,di) are assumed to have Normally
distributed noise ei with zero mean and variance σ2, added
to the true target value f(xi),
di satisfies
Assume that ei is drawn independently for each xi .
Compute ML Hypo
m
1 1 d i h( xi ) 2
arg max ln(2 ) (
2
)
hH i 1 2 2
m
arg min (d i h( xi )) 2
hH i 1
208
Bayes Optimal Classifier
Question: Given new instance x, what is its most probable classification?
• is not the most probable classification!
Example: Let P(h1|D) = .4, P(h2|D) = .3, P(h3 |D) =.3
Given new data x, we have h1(x)=+, h2(x) = -, h3(x) = -
What is the most probable classification of x ?
Bayes optimal classification:
arg max P(v j | hi ) P(hi | D)
v j V hi H
where V is the set of all the values a classification can take and vjis one
possible such classification.
Example:
P(h1| D) =.4, P(-|h1)=0, P(+|h1)=1 P( | h ) P(h | D) .4
hiH
i i
210
Gibbs Algorithm
• Bayes optimal classifier is quite computationally
expensive, if H contains a large number of
hypotheses.
• An alternative, less optimal classifier Gibbs algorithm,
defined as follows:
1. Choose a hypothesis randomly according to P(h|
D), where D is the posterior probability
distribution over H.
2. Use it to classify new instance
211
Naïve Bayes
• Bayes classification
P(Y | X) P( X | Y ) P(Y ) P( X 1 ,, X n | Y ) P(Y )
Difficulty: learning the joint probability P(X1,,Xn | C)
• Naïve Bayes classification
Assume all input features are conditionally independent!
P( X 1 , X 2 ,, X n | Y ) P( X 1 | X 2 ,, X n , Y ) P( X 2 ,, X n | Y )
P( X 1 | Y ) P( X 2 ,, X n | Y )
P( X 1 | Y ) P( X 2 | Y ) P( X n | Y )
Naïve Bayes
Bayes rule:
• Classify (Xnew)
*
probabilities must sum to 1, so need estimate only n-1 parameters...
Estimating Parameters: Y, Xi discrete-valued
216
Example
Learning Phase
Outloo Play=Y Play= Temperature Play=Yes Play=No
k es No Hot 2/9 2/5
Sunny 2/9 3/5 Mild 4/9 2/5
Overcas 4/9 0/5
t
Cool 3/9 1/5
Rain 3/9 2/5
Humidity Play=Yes Play=No Wind Play=Yes Play=No
High 3/9 4/5 Strong 3/9 3/5
Normal 6/9 1/5 Weak 6/9 2/5
217
Example
Test Phase
– Given a new instance, predict its label
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables achieved in the learning phrase
P(Outlook=Sunny|Play=No) = 3/5
P(Outlook=Sunny|Play=Yes) = 2/9
P(Temperature=Cool|Play==No) = 1/5
P(Temperature=Cool|Play=Yes) = 3/9
P(Huminity=High|Play=No) = 4/5
P(Huminity=High|Play=Yes) = 3/9
P(Wind=Strong|Play=No) = 3/5
P(Wind=Strong|Play=Yes) = 3/9
P(Play=No) = 5/14
P(Play=Yes) = 9/14
– Decision making with the MAP rule
P(Yes|x’) ≈ [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) =
0.0053
P(No|x’) ≈ [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) =
0.0206
MAP estimates:
Only difference:
“imaginary” examples
Naïve Bayes: Assumptions of Conditional
Independence
Often the Xi are not really conditionally independent
• Classify (Xnew)
Estimating Parameters: Y discrete, Xi continuous
ˆ 1 ( x 21 . 64 ) 2
1 ( x 21 . 64) 2
P ( x | Yes ) exp 2
exp
2.35 2 2 2.35 2.35 2 11 .09
ˆ 1 ( x 23. 88 ) 2
1 ( x 23 . 88 ) 2
P ( x | No)
exp 2
exp
7.09 2 2 7.09 7.09 2 50.25
224
The independence hypothesis…
• makes computation possible
• yields optimal classifiers when satisfied
• Rarely satisfied in practice, as attributes (variables)
are often correlated.
• To overcome this limitation:
– Bayesian networks combine Bayesian reasoning with
causal relationships between attributes
Bayesian Networks
Why Bayes Network
• Bayes optimal classifier is too costly to apply
• Naïve Bayes makes overly restrictive
assumptions.
– But all variables are rarely completely independent.
• Bayes network represents conditional
independence relations among the features.
• Representation of causal relations makes the
representation and inference efficient.
Late Rainy
Accident
wakeup day
Traffic Meeting
Jam postponed
Late for
Work
Late for
meeting
Bayesian Network
• A graphical model that efficiently encodes the joint
probability distribution for a large set of variables
• A Bayesian Network for a set of variables (nodes)
X = { X1,…….Xn}
• Arcs represent probabilistic dependence among
variables
• Lack of an arc denotes a conditional independence
• The network structure S is a directed acyclic graph
• A set P of local probability distributions at each node
(Conditional Probability Table)
Representation in Bayesian Belief
Networks
Late
Accide Rainy
wake Conditional probability table
nt day
up
associated with each node
specifies the conditional
Traffic Meeting distribution for the
Jam postponed variable given its immediate
Late parents in
for the graph
Work
Late for
meeting
231
Applications of Bayesian Networks
• Diagnosis: P(cause|symptom)=?
• Prediction: P(symptom|cause)=?
• Classification: P(class|data) cause
cause
• Decision-making
(given a cost function) C C2
1
symptom
symptom
Bayesian Networks
• Structure of the graph Conditional independence relations
In general,
p(X1, X2,....XN) = p(Xi | parents(Xi ) )
A: D
Conditionally independent effects:
p(A,B,C) = p(B|A)p(C|A)p(A)
B and C are conditionally independent
B: S1 C: S2 Given A
A B C Markov dependence:
p(A,B,C) = p(C|B) p(B|A)p(A)
Naïve Bayes Model
Y1 Y2 Y3 Yn
C
Hidden Markov Model (HMM)
Y1 Y3 Yn
Observed
Y2
----------------------------------------------------
S1 S2 S3 Sn Hidden
Assumptions:
1. hidden state sequence is Markov
2. observation Yt is conditionally independent of all other
variables given St
244
Maximize likelihood
• How do we maximize the likelihood? Gradient ascent
– Updates:
Assume one training example (x,y), and take derivatives to
derive the stochastic gradient ascent rule.
Introduction to Support Vector
Machine
248
Support Vector Machines
• SVMs have a clever way to prevent overfitting
• They can use many features without requiring
too much computation.
249
Logistic Regression and Confidence
• Logistic Regression:
250
Preventing overfitting with many features
• Suppose a big set of features.
• What is the best separating
line to use?
• Bayesian answer:
– Use all
– Weight each line by its
posterior probability
• Can we approximate the correct
answer efficiently?
251
Support Vectors
• The line that maximizes the
minimum margin.
• This maximum-margin separator is
determined by a subset of the
datapoints.
– called “support vectors”.
– we use the support
vectors to decide which The support vectors are
side of the separator a test indicated by the circles
case is on. around them.
252
Functional Margin
• Functional Margin of a point wrt
– Measured by the distance of a point from the
decision boundary
253
Geometric Margin
• For a decision surface P=(a1,a2)
• the vector orthogonal to it is
given by . Q=(b1,b2) →
(w,b)
254
Geometric Margin
P=(a1,a2)
Q=(b1,b2) →
𝑤
(w,b)
Geometric margin :
Geometric margin of (w,b) wrt S=
-- smallest of the geometric margins of individual points. 255
Maximize margin width denotes +1
denotes -1
x2
• Assume linearly separable Margin
training examples.
• The classifier with the
maximum margin width is
robust to outliners and thus
has strong generalization
ability
x1
256
Maximize Margin Width
• Maximize subject to
• for
• Scale so that
• Maximizing is the same as minimizing
• Minimizesubject to the constraints
• for all , :
if
if
257
Large Margin Linear Classifier
• Formulation: x2
Margin
1 2 x+
minimize w 1
2 b =
T x+
w = 0 -1
such that x +
+ b b =
T x +
w T x
w
n
yi (w T xi b) 1 x-
x1
denotes +1
denotes -1 258
Solving the Optimization Problem
1 2
minimize w
2
s.t. yi (w T xi b) 1
260
Non-linear decision surface
• We saw how to deal with datasets which are linearly
separable with noise.
• What if the decision boundary is truly non-linear?
• Idea: Map data to a high dimensional space where it
is linearly separable.
– Using a bigger set of features will make the computation
slow?
– The “kernel” trick to make the computation fast.
261
Non-linear SVMs: Feature Space
Φ : 𝑥 → 𝜙(𝑥 )
Non-linear SVMs: Feature Space
Φ : 𝑥 → 𝜙(𝑥 )
264
Nonlinear SVMs: The Kernel Trick
• With this mapping, our discriminant function is now:
K(xi,xj) = (1 + xixj)2,
= 1+ xi12xj12 + 2 xi1xj1xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2].[1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]
= φ(xi).φ(xj),
where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
Commonly-used kernel functions
• Linear kernel:
• Polynomial of power p:
• Sigmoid
269
SVM examples
such that 0 i C
n
y
i 1
i i 0
Neural Unit
ANNs
• ANNs incorporate the two fundamental components of
biological neural nets:
1. Nodes - Neurones
2. Weights - Synapses
Perceptrons
• Basic unit in a neural network: Linear separator
– N inputs, x1 ... xn
– Weights for each input, w1 ... wn
– A bias input x0 (constant) and associated weight w0
– Weighted sum of inputs,
– A threshold function, i.e., if y > 0, if y <= 0
x0
x1 w0
w1
x2 w2
⋮ Σ 𝜑=¿
𝑦=∑ 𝑤 𝑖 𝑥 𝑖
wn
xn
Perceptron training rule
Updates perceptron weights for a training ex as follows:
wn
xn w.x wi xi
Neuron Model: Logistic Unit
1 1
(z) z
1 e 1 e w .x
Training Rule:
Multi-layer Neural Network
Limitations of Perceptrons
• Perceptrons have a monotinicity property:
If a link has positive weight, activation can only increase as the
corresponding input value increases (irrespective of other
input values)
• Can’t represent functions where input interactions can cancel
one another’s effect (e.g. XOR)
• Can represent only linearly separable functions
A solution: multiple layers
output layer
y y
z2
hidden layer
z1 z2 z1
x2
input layer
x1 x2 x1
Power/Expressiveness of Multilayer
Networks
• Can represent interactions among inputs
• Two layer networks can represent any Boolean
function, and continuous functions (within a
tolerance) as long as the number of hidden units is
sufficient and appropriate activation functions used
• Learning algorithms exist, but weaker guarantees
than perceptron learning algorithms
Multilayer Network
Outputls
Inputs
First Second
Input hidden hidden Output
layer layer layer
Two-layer back-propagation neural network
Input signals
1
x1 1 y1
1
2
x2 2 y2
2
i wij j wjk
xi k yk
n1
n n2 yn2
xn
Input Hidden Output
layer layer
Error signals
292
The back-propagation training algorithm
• Step 1: Initialisation
Set all the weights and threshold levels of the network to
random numbers uniformly distributed inside a small range
1
v01
v11 1
x1 1 1 w11
v21 w01
1 y1
v22
x2 2 2 w21
v22
Input v02 Output
1
x z y
Backprop
• Initialization
– Set all the weights and threshold levels of the network to
random numbers uniformly distributed inside a small
range
• Forward computing:
– Apply an input vector x to input units
– Compute activation/output vector z on hidden layer
)
– Compute the output vector y on output layer
)
y is the result of the computation.
Learning for BP Nets
• Update of weights in W (between output and hidden layers):
– delta rule
• Not applicable to updating V (between input and hidden)
– don’t know the target values for hidden units z1, Z2, … ,ZP
• Solution: Propagate errors at output units to hidden units to
drive the update of weights in V (again by delta rule)
(error BACKPROPAGATION learning)
• Error backpropagation can be continued downward if the net
has more than one hidden layer.
• How to compute errors on hidden units?
Derivation
• For one output neuron, the error function is
With
+ 0
o x
0
+
x: class I (y = 1)
x1 o: class II (y = -1)
xor
x2
Boolean OR + +
OR
- + x1
input input
ouput
x1 x2
w0= -0.5
0 0 0
0 1 1 1
w1=1 w2=1
1 0 1
1 1 1 x1 x2
x2
Boolean AND - +
AND
input input x1
ouput - -
x1 x2
w0= -1.5
0 0 0
0 1 0 1
w1=1 w2=1
1 0 0
1 1 1 x1 x2
x2
Boolean XOR
+ -
XOR
input input
ouput
x1 x2 x1
- +
0 0 0
0 1 1
1 0 1
1 1 0
Boolean XOR
XOR
o -0.5
input input 1 -1
ouput
x1 x2
OR AND
0 0 0 -0.5 h1 h1 -1.5
0 1 1
1
1 0 1 1
1 1
1 1 0
x1 x1
Representation Capability of NNs
• Single layer nets have limited representation power (linear
separability problem). Multi-layer nets (or nets with non-
linear hidden units) may overcome linear inseparability
problem.
• Every Boolean function can be represented by a network
with a single hidden layer.
• Every bounded continuous function can be approximated
with arbitrarily small error, by network with one hidden layer
• Any function can be approximated to arbitrary accuracy by a
network with two hidden layers.
Multilayer Network
Outputls
Inputs
First Second
Input hidden hidden Output
layer layer layer
Two-layer back-propagation neural network
Input signals
1
x1 1 y1
1
2
x2 2 y2
2
i wij j wjk
xi k yk
n1
n n2 yn2
xn
Input Hidden Output
layer layer
Error signals
307
Derivation
• For one output neuron, the error function is
with
where
Backpropagation
• Gradient descent over entire network weight vector
• Can be generalized to arbitrary directed graphs
• Will find a local, not necessarily global error minimum
• May include weight momentum
Stopping
1. Fixed maximum number of epochs: most naïve
2. Keep track of the training and validation error
curves.
Overfitting in ANNs
Local Minima
Trainable Trainable
Input Feature … Feature
Trainable
Classifier
Output
Extractor Extractor
Unsupervised Pre-training
We will use greedy, layer wise pre-training
Train one layer at a time
Fix the parameters of previous hidden layers
Previous layers viewed as feature extraction
find hidden unit features that are more common in training
input than in random inputs
Tuning the Classifier
• After pre-training of the layers
– Add output layer
– Train the whole network using
supervised learning (Back propagation)
Deep neural network
• Feed forward NN
• Stacked Autoencoders (multilayer neural net
with target output = input)
• Stacked restricted Boltzmann machine
• Convolutional Neural Network
A Deep Architecture: Multi-Layer
Perceptron
Output Layer
y
Here predicting a supervised target
h3 …
Hidden layers
These learn more
abstract representations h2 …
as you head up
h1 …
Input layer
x …
Raw sensory inputs
A Neural Network
• Training : Back
Propagation of Error
– Calculate total error at
the top
– Calculate contributions to
error at each step going INPUT LAYER HIDDEN LAYER OUTPUT LAYER
backwards
– The weights are modified
as the error is propagated
Training Deep Networks
• Difficulties of supervised training of deep networks
1. Early layers of MLP do not get trained well
• Diffusion of Gradient – error attenuates as it propagates to
earlier layers
• Leads to very slow training
• the error to earlier layers drops quickly as the top layers "mostly"
solve the task
2. Often not enough labeled data available while there may be
lots of unlabeled data
3. Deep networks tend to have more local minima problems
than shallow networks during supervised training
326
Training of neural networks
• Forward Propagation :
– Sum inputs, produce
activation
– feed-forward
• sigmoid(x) =
• Rectified linear
relu(x) = max(0,x)
- Simplifies backprop
- Makes learning faster
- Make feature sparse
→ Preferred option
Autoencoder
Unlabeledtraining examples
set
a1 {, , . . . },
a2
Set the target values to be
equal to the inputs. =
a3 Network is trained to output
the input (learn identify
function).
331
Stacked Auto-Encoders
• Do supervised training on the last layer using final
features
• Then do supervised training on the entire network to fine-
tune all weights
e zi
yi
e j
z
j
332
Convolutional Neural netwoks
• A CNN consists of a number of convolutional and
subsampling layers.
• Input to a convolutional layer is a m x m x r image
where m x m is the height and width of the image and r is
the number of channels, e.g. an RGB image has r=3
• Convolutional layer will have k filters (or kernels)
• size n x n x q
• n is smaller than the dimension of the image and,
• q can either be the same as the number of channels r or
smaller and may vary for each kernel
Convolutional Neural netwoks
339
Goal of Learning Theory
• Two core aspects of ML
– Algorithm Design. How to optimize?
– Confidence for rule effectiveness on future data.
• We need particular settings (models)
– Probably Approximately Correct (PAC)
C h
C h=
Error region
340
Prototypical Concept Learning Task
• Given
h ❑
– Instances X (e.g.,
– Distribution over X -
+ +
+ + 𝑐 −- -
– Target function c -
– Hypothesis Space Instance space X
– Training Examples S = i.i.d. from
• Determine
– A hypothesis s.t. for all in S?
– A hypothesis s.t. for all in X?
• An algorithm does optimization over S, find hypothesis h.
• Goal: Find h which has small error over
341
Computational Learning Theory
• Can we be certain about how the learning algorithm
generalizes?
• We would have to see all the examples.
• Inductive inference –
h ❑
generalizing beyond the training
data is impossible unless we add -
+ +
+ + 𝑐 −- -
-
more assumptions (e.g., Instance space X
priors over H)
We need a bias!
Function Approximation
• How many labeled examples in order to determine which of the
hypothesis is the correct one?
• All instances in X must be labeled!
• Inductive inference: generalizing beyond the training data is
impossible unless we add more assumptions (e.g., bias)
- ||H|=
c + h1
+ + +
-
- + -
h2
Instance space X
Error of a hypothesis
The true errorof hypothesis h, with respect to the target concept
cand observation distribution is the probability that h will
misclassify an instance drawn according to
labeled examples are sufficient so that with prob. all with have
• inversely linear in 𝜖
• logarithmic in |H|
• 𝜖 error parameter: D might place low weight on certain parts of the
space
• 𝛿 confidence parameter: there is a small chance the examples we
get are not representative of the distribution
Sample Complexity for Supervised Learning
Theorem: labeled examples are sufficient so that with prob. all with have
Proof: Assume k bad hypotheses Hbad={} with
Consistent Case
Theorem
Inconsistent Case
What if there is no perfect h?
Theorem: After m examples, with probability all have
for
Sample complexity: example
• : Conjunction of n Boolean literals. Is PAC-
learnable?
• Concrete examples:
– δ=ε=0.05, n=10 gives 280 examples
– δ=0.01, ε=0.05, n=10 gives 312 examples
– δ=ε=0.01, n=10 gives 1,560 examples
– δ=ε=0.01, n=50 gives 5,954 examples
• Result holds for any consistent learner, such as FindS.
Sample Complexity of Learning
Arbitrary Boolean Functions
• Consider any boolean function over nboolean features
such as the hypothesis space of DNF or decision trees.
There are 22^n of these, so a sufficient number of examples
to learn a PAC concept is:
354
Thank You
Concept Learning Task
“Days in which Aldo enjoys swimming”
Example Sky AirTemp Humidity Wind Water Forecast EnjoySport
1 Sunny Warm Normal Strong Warm Same Yes
2 Sunny Warm High Strong Warm Same Yes
3 Rainy Cold High Strong Warm Change No
4 Sunny Warm High Strong Cool Change Yes
x y
Shattering Instances (cont)
But 3 instances cannot be shattered by a single interval.
x y z
x y z + –
_
x,y,z
x y,z
y x,z
x,y z
x,y,z
y,z x
Cannot do z x,y
x,z y
367
VC Dimension
• The Vapnik-Chervonenkis dimension, VC(H). of hypothesis space H
defined over instance space X is the size of the largest finite subset
of X shattered by H. If arbitrarily large finite subsets of X can be
shattered then VC(H) =
• For a single intervals on the real line, all sets of 2 instances can be
shattered, but no set of 3 instances can, so VC(H) = 2.
VC Dimension
• An unbiased hypothesis space shatters the entire instance
space.
• The larger the subset of X that can be shattered, the more
expressive (and less biased) the hypothesis space is.
• The VC dimension of the set of oriented lines in 2-d is
three.
369
VC Dimension Example
Consider axis-parallel rectangles in the real-plane,
i.e. conjunctions of intervals on two real-valued
features. Some 4 instances can be shattered.
• Therefore VC(H) = 4
• Generalizes to axis-parallel hyper-rectangles (conjunctions of
intervals in n dimensions): VC(H)=2n.
371
Upper Bound on Sample Complexity with VC
372
Sample Complexity Lower Bound with VC
• There is also a general lower bound on the minimum number of
examples necessary for PAC learning (Ehrenfeucht, et al., 1989):
Consider any concept class C such that ,any learner and any
Then there exists a distribution D and target concept in C such that
if L observes fewer than:
1 1 VC (C ) 1
max log 2 ,
32
examples, then with probability at least δ, L outputs a hypothesis
having error greater than ε.
• Ignoring constant factors, this lower bound is the same as the
upper bound except for the extra log2(1/ ε) factor in the upper
bound.
373
Foundations of Machine Learning
Sudeshna Sarkar
IIT Kharagpur
What is Ensemble Classification?
• Use multiple learning algorithms (classifiers)
• Combine the decisions
• Can be more accurate than the individual classifiers
• Generate a group of base-learners
• Different learners use different
– Algorithms
– Hyperparameters
– Representations (Modalities)
– Training sets
Why should it work?
• Works well only if the individual classifiers
disagree
– Error rate < 0.5 and errors are independent
– Error rate is highly correlated with the correlations
of the errors made by the different learners
Bias vs. Variance
• We would like low bias error and low variance error
• Ensembles using multiple trained (high variance/low
bias) models can average out the variance, leaving
just the bias
– Less worry about overfit (stopping criteria, etc.)
with the base models
Combining Weak Learners
• Combining weak learners
– Assume n independent models, each having accuracy of 70%.
– If all n give the same class output then you can be confident it
is correct with probability 1-(1-.7)n.
– Normally not completely independent, but unlikely that all n
would give the same output
• Accuracy better than the base accuracy of the models by using the
majority output.
– If n1 models say class 1 and n2<n1 models say class 2, then
P(class1) = 1 – Binomial(n, n2, .7)
Ensemble Creation Approaches
• Get less correlated errors between models
– Injecting randomness
• initial weights (eg, NN), different learning parameters,
different splits (eg, DT) etc.
– Different Training sets
• Bagging, Boosting, different features, etc.
– Forcing differences
• different objective functions
– Different machine learning model
Ensemble Combining Approaches
• Unweighted Voting (e.g. Bagging)
• Weighted voting – based on accuracy (e.g. Boosting),
Expertise, etc.
• Stacking - Learn the combination function
Combine Learners: Voting
• Unweighted voting
• Linear combination
(weighted vote)
• weight accuracy
• weight
• Bayesian
Fixed Combination Rules
Bayes Optimal Classifier
• The Bayes Optimal Classifier is an ensemble of all the
hypotheses in the hypothesis space.
• On average, no other ensemble can outperform it.
• The vote for each hypothesis
– proportional to the likelihood that the training dataset
would be sampled from a system if that hypothesis were
true.
– is multiplied by the prior probability of that hypothesis.
• y is the predicted class,
• C is the set of all possible classes,
• H is the hypothesis space,
• T is the training data.
The Bayes Optimal Classifier represents a hypothesis
that is not necessarily in H.
But it is the optimal hypothesis in the ensemble space.
Practicality of Bayes Optimal Classifier
• Cannot be practically implemented.
• Most hypothesis spaces are too large
• Many hypotheses output a class or a value, and not
probability
• Estimating the prior probability for each hypothesizes
is not always possible.
BMA
• All possible models in the model space used
weighted by their probability of being the “Correct”
model
• Optimal given the correct model space and priors
Why are Ensembles Successful?
• Bayesian perspective:
P C i | x P C | x , M P M
all models M
i j j
j
• If dj are independent
1 1 1 1
Var y Var d j 2 Var d j 2 L Var d j Var d j
j L L j L L
Sudeshna Sarkar
IIT Kharagpur
Bagging
• Bagging = “bootstrap aggregation”
– Draw N items from X with replacement
• Desired learners with high variance (unstable)
– Decision trees and ANNs are unstable
– K-NN is stable
• Use bootstrapping to generate L training sets and
train one base-learner with each (Breiman, 1996)
• Use voting
Bagging
• Sampling with replacement
B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - - = 1.9459
Illustrating AdaBoost
B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - - = 1.9459
B2
0.3037 0.0009 0.0422
Boosting
Round 2 - - - - - - - - ++ = 2.9323
B3
0.0276 0.1819 0.0038
Boosting
Round 3 +++ ++ ++ + ++ = 3.8744
Overall +++ - - - - - ++
Thank You
Foundations of Machine Learning
Module 9: Clustering
Part A: Introduction and kmeans
Sudeshna Sarkar
IIT Kharagpur
Unsupervised learning
• Unsupervised learning:
– Data with no target attribute. Describe hidden structure from
unlabeled data.
– Explore the data to find some intrinsic structures in them.
• Clustering: the task of grouping a set of objects in such a way
that objects in the same group (called a cluster) are more
similar to each other than to those in other clusters.
• Useful for
– Automatically organizing data.
– Understanding hidden structure in data.
– Preprocessing for further analysis.
403
Applications: News Clustering (Google)
Gene Expression Clustering
Other Applications
• Biology: classification of plants and animal kingdom
given their features
• Marketing: Customer Segmentation based on a
database of customer data containing their
properties and past buying records
• Clustering weblog data to discover groups of similar
access patterns.
• Recognize communities in social networks.
An illustration
• This data set has four natural clusters.
407
An illustration
• This data set has four natural clusters.
408
Aspects of clustering
• A clustering algorithm such as
– Partitional clustering eg, kmeans The quality of a
– Hierarchical clustering eg, AHC clustering result
– Mixture of Gaussians depends on the
• A distance or similarity function algorithm, the
distance function,
– such as Euclidean, Minkowski, cosine
and the application.
• Clustering quality
– Inter-clusters distance maximized
– Intra-clusters distance minimized
409
Major Clustering Approaches
• Partitioning: Construct various partitions and then evaluate
them by some criterion
• Hierarchical: Create a hierarchical decomposition of the set of
objects using some criterion
• Model-based: Hypothesize a model for each cluster and find
best fit of models to data
• Density-based: Guided by connectivity and density functions
• Graph-Theoretic Clustering
410
Partitioning Algorithms
• Partitioning method: Construct a partition of a
database D of m objects into a set of k clusters
• Given a k, find a partition of k clusters that optimizes
the chosen partitioning criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic method: k-means (MacQueen, 1967)
411
Hierarchical Clustering
animal
vertebrate invertebrate
• Pearson correlation
Quality of Clustering
• Internal evaluation:
– assign the best score to the algorithm that produces clusters with high
similarity within a cluster and low similarity between clusters, e.g.,
Davies-Bouldin index
• External evaluation:
– evaluated based on data such as known class labels and external
benchmarks, eg, Rand Index, Jaccard Index, f-measure
Thank You
Foundations of Machine Learning
Module 9: Clustering
Part C: Hierarchical Clustering
Sudeshna Sarkar
IIT Kharagpur
Hierarchical Clustering
animal
vertebrate invertebrate
422
Dendrogram: Hierarchical Clustering
Dendrogram
0.2
– Given an input set S
0.15
of S
0.05
p2
p3
p4
p5
.
.
.
Distance/Proximity Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate State
• After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
C1
Distance/Proximity Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate State
Merge the two closest clusters (C2 and C5) and update the
distance matrix.
C1 C2 C3 C4 C5
C1
C3 C2
C3
C4
C4
C5
C1
Distance/Proximity Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
After Merging
• Update the distance matrix
C2
U
C1 C5 C3 C4
C3 C1 ?
C4 C2 U C5 ? ? ? ?
C3 ?
C1 C4 ?
C2 U C5
...
p1 p2 p3 p4 p9 p10 p11 p12
Closest Pair
• A few ways to measure distances of two clusters.
• Single-link
– Similarity of the most similar (single-link)
• Complete-link
– Similarity of the least similar points
• Centroid
– Clusters whose centroids (centers of gravity) are the
most similar
• Average-link
– Average cosine between pairs of elements
431
Distance between two clusters
• Single-link distance between clusters Ci and Cj
is the minimum distance between any object
in Ci and any object in Cj
433
Single-link clustering: example
• Determined by one pair of points, i.e., by one
link in the proximity graph.
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Complete link method
435
Complete-link clustering: example
• Distance between clusters is determined by
the two most distant points in the different
clusters
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Complete Link Example
437
Computational Complexity
• In the first iteration, all HAC methods need to
compute similarity of all pairs of N initial instances,
which is O(N2).
• In each of the subsequent N2 merging iterations,
compute the distance between the most recently
created cluster and all other existing clusters.
• In order to maintain an overall O(N2) performance,
computing similarity to each other cluster must be
done in constant time.
– Often O(N3) if done naively or O(N2 log N) if done more
cleverly 438
Average Link Clustering
• Similarity of two clusters = average similarity between
any object in Ci and any object in Cj
1
sim(ci , c j )
Ci C j
sim( x, y)
xCi yC j
• Compromise between single and complete link. Less
susceptible to noise and outliers.
• Two options:
– Averaged across all ordered pairs in the merged
cluster
– Averaged over all pairs between the two original
clusters
439
The complexity
• All the algorithms are at least O(n2). n is the
number of data points.
• Single link can be done in O(n2).
• Complete and average links can be done in
O(n2logn).
• Due the complexity, hard to use for large data
sets.
440
Model-based clustering
• Assume data generated from probability
distributions
• Goal: find the distribution parameters
• Algorithm: Expectation Maximization (EM)
• Output: Distribution parameters and a soft
assignment of points to clusters
Model-based clustering
• Assume probability distributions with parameters
• Given data , compute such that
) [likelihood]or
[log likelihood]
is maximized.
• Every point may be generated by multiple
distributions with some probability
EM Algorithm
• Initialize the parameters randomly
• Let each parameter corresponds to a cluster center (mean)
• Iterate between two steps
– Expectation step: (probabilistically) assign points to
clusters
Module 9: Clustering
Part B: kmeans clustering
Sudeshna Sarkar
IIT Kharagpur
Partitioning Algorithms
• Given k
• Construct a partition of m objects
where is a vector in a real-valued space ,n is the number of attributes.
• into a set of k clusters
• The cluster mean serves as a prototype of the cluster .
• Find k clusters that optimizes a chosen criterion
– E.g., the within-cluster sum of squares (WCSS)
(sum of distance functions of each point in the cluster to the
cluster mean)
450
Stopping/convergence criterion
OR
1. no re-assignments of data points to different
clusters
2. no (or minimum) change of centroids
3. minimum decrease in the sum of squared error
451
Kmeans illustrated
Similarity / Distance measures
• Distance metric (scale-dependent)
– Minkowski family of distance measures
• Pearson correlation
Convergence of K-Means
• Recomputation monotonically decreases
each square error since
( is number of members in cluster j):
reaches minimum for:
455
Time Complexity
• Computing distance between two items is
O(n) where n is the dimensionality of the
vectors.
• Reassigning clusters: O(km) distance
computations, or O(kmn).
• Computing centroids: Each item gets added
once to some centroid: O(mn).
• Assume these two steps are each done once
for t iterations: O(tknm).
456
Advantages
• Fast, robust easy to understand.
• Relatively efficient: O(tkmn)
• Gives best result when data set are distinct or
well separated from each other.
Disadvantages
• Requires apriori specification of the number of
cluster centers.
• Hard assignment of data points to clusters
• Euclidean distance measures can unequally
weight underlying factors.
• Applicable only when mean is defined i.e. fails
for categorical data.
• Only local optima
K-Means on RGB image
x1={r1, g1, b1} Classification Results
x1C(x1)
x2={r2, g2, b2}
x2C(x2)
… …
Classifier
xi={ri, gi, bi} xiC(xi)
(K-Means)
…
…
Cluster Parameters
1 for C1
2for C2
…
k for Ck