0% found this document useful (0 votes)
16 views54 pages

05-Classification-II-2024

The document covers advanced topics in machine learning, specifically focusing on classification techniques such as Decision Trees and Naïve Bayes. It discusses the structure, training, and evaluation of Decision Trees, including issues like overfitting and regularization, as well as the application of Bayes Theorem in Naïve Bayes classifiers. The content is aimed at providing a comprehensive understanding of these classification methods within the context of supervised learning.

Uploaded by

Arkajyoti Saha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views54 pages

05-Classification-II-2024

The document covers advanced topics in machine learning, specifically focusing on classification techniques such as Decision Trees and Naïve Bayes. It discusses the structure, training, and evaluation of Decision Trees, including issues like overfitting and regularization, as well as the application of Bayes Theorem in Naïve Bayes classifiers. The content is aimed at providing a comprehensive understanding of these classification methods within the context of supervised learning.

Uploaded by

Arkajyoti Saha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

EECS708: Machine Learning

Classification – part 2
Dr. Ioannis Patras
School of EECS

Slide thanks: Dr. Tim Hospedales


Course Context
• Supervised Learning
– (Linear) regression
– (Linear) Classifiers and Logistic Regression
– Neural Networks
• Unsupervised
– Clustering
– Density Estimation
– Dimensionality reduction (partial)
• Advanced topics
– Deep Learning, Convolutional Neural Networks
– Ensemble Learning
Classification: Overview

• Decision Trees
• Naïve Bayes
• Practical Issues and performance metrics
Decision Trees:
Play tennis dataset
Decision Trees: Contingency Tables
• For every combination of attributes, record
how frequently it occurs
• Check the cube to predict new data
– Would be slow

ast
Ov y

iny
nn
– Decision tree can compress the cube

erc
Ra
Su
Humid
Not Humid

Play

No Play
Decision Trees: Model Structure & Test
time procedure
• Internal nodes:
– Test the value of a particular attribute: Equality /
Inequality.
– Branch according to the result
• Leaf nodes:
– Specify the class f(x)
• Test time:
Classify x* by sending it down the tree
Decision Tree: How to Grow/Train
• What algorithm do you think can construct a
tree from data?
– Hint: It’s recursive.
• Suppose you have a magic pick-best attribute
function?
Decision Tree: How to Grow/Train

Recursive Algorithm:
• Grow(T)
– if All y=0, return Leaf(0)
– elseif All y=1, return Leaf(1)
– else
• xj = ChooseBestAttribute(T)
• T0 = <x,y> in T with xj=0
• T1 = <x,y> in T with xj=1
• Return Node(xj, Grow(T0), Grow(T1))
Grow/Train: How to choose best
attribute?
• Pick the attribute that greedily maximizes
accuracy?
– In this example, x1
• j = ChooseBestAttribute(T)
– Choose j to minimize
– #Examples <x,y> in T0 with y!=0
+
– #Examples <x,y> in T1 with y!=1
• (Actually, minimize information gain instead)
Entropy and Information Gain
Entropy of a random variable Y High Entropy

𝐻 𝑌 = − ∑! 𝑝 𝑦 log(𝑝 𝑦 ))

Low Entropy Low Entropy

A measure of how uniform/peaked a distribution is


Information Gain in decision trees
Entropy of a random variable Y Before the split

𝐻 𝑌 = − ∑! 𝑝 𝑦 log(𝑝 𝑦 ))

𝑌" distribution at
the right branch
𝑌! distribution at
the left branch

Information gain of Y given a split A={l,r}


𝐼𝐺 𝑌|𝐴 = 𝐻 𝑌 − ∑𝑝" 𝐻(𝑌" )
𝑝" proportion of the data in branch 𝑎 ∈ {𝑙, 𝑟}
Training with non-boolean features

• Nominal
– Test one value versus all the others
(Outlook=Sunny)
– Group into disjoint subsets. (Postcode = W1)
• Continuous
– Threshold inequality xj > th
Decision Tree: What can they
represent? (Nominal Data)
• Depth 1 tree p

– Any Boolean function of 1 feature.. f t


p
• Depth 2 tree
f q
– Any Boolean function of two features..
f t
• DT can represent any boolean function
– (But worst case 2^N leaves)
Decision Tree: What can they
represent? (Continuous Data)
• If Length > L1
– Then Salmon
• Else
– If Lightness >L2
• Then Cod
– Else
• Then Salmon
• Represent:
– Axis parallel cuts.
– Can approximate but not exactly
represent diagonal boundaries.
– Can become arbitrarily complex
with enough data
Decision Tree: Over-fitting &
Regularization

• Suppose one unusual day: [Sunny, Hot, Normal,


Strong, Play=No]
– What happens to tree?
– New (noisy) nodes will be grown under Sunny-
Normal-…
Decision Tree: Over-fitting
• Overfitting, formally:
– Train Error (known): E(M,Dtrain)
– Future Error (unknown): E(M, Dall)
– Overfit model M if:
• If there is some other model M’
• E(M, Dtrain) < E(M’, Dall)
• E(M, Dall) > E(M’, Dall) o

o
o
Decision Tree: Regularization
Avoiding Over-fitting for a decision tree. Ideas?
1. Grow full tree then prune
– How to guide pruning?
• Measure performance on train data?
• Measure performance on validation data?
2. Add regularizer to split objective
– xj = ChooseBestAttribute(T)
– If error improvement < λ* #nodes
• Then skip
– (Determine λ by validation)
Decision Tree: Summary
• Decision Tree Classifier: Properties:
• Good:
• Representation: Tree – Mixed-type data (no 1-of-N
encode!)
• Evaluation: Accuracy – High dimensions
• Train: Greedy, Recursive • Classification at test time can
take < O(d)!
• Test: Traverse tree – CF: NN: O(dn), MaxEnt: O(d)
• Frequently used in industry
• Prevent overfit: • May be Interpretable
• Regularize on # of nodes, or • Optimal tree is NP-complete
– Practical trees are not
• Pruning optimal, but good enough
• Some pathological problems
can’t be represented as trees
Classification: Overview

• Decision Trees
• Naïve Bayes
– Bayes Theorem
– ML fitting
• Practical Issues
Naïve Bayes – Bayes Theorem Recap
• Bayes Theorem
– P(A) “Prior probability of A” p(A | B) =
p(B | A)p(A)
p(B)
– P(B|A) “Probability of B given A”
p(B | A)p(A) = p(A, B)
p(A) = ∑ p(A, B)
B
Naïve Bayes – Bayes Theorem Recap
• Bayes Theorem:
– P(H) “Prior probability of hypothesis H”
– P(D|H) “Probability of data D given hypothesis H”

p(D | H )p(H )
p(H | D) =
p(D)

p(D | H )p(H ) = p(D, H )


p(D) = ∑ p(D, H )
H
Naïve Bayes – Bayes Theorem Recap
• Bayes Theorem.
– Hypothesis = {C,!C}, Data = {+,-} p(H | D) =
p(D | H )p(H )
p(D)
– P(C) = 0.008
– P(!C) = 0.992 p(D | H )p(H ) = p(D, H )
– P(+|C) = 0.98 p(D) = ∑ p(D, H )
P(-|C)=0.02
H

– P(+|!C)=0.03
– P(-|!C)=0.97
0.98 × 0.008
p(C | +) =
0.98 × 0.008 + 0.03× 0.992
p(C | +) ≈ 0.2
Bayes Theorem - Visualization
• Set interpretation
– P(A) is size of set A in the world
– P(A,B) is the size of the intersection of set A&B
– P(A|B) is the fraction of the space where B is
true that A is also true
H = “Have a headache”
F = “Coming down with Flu”

P(H) = 1/10
P(F) = 1/40
P(H|F) = 1/2

“Headaches are rare and flu is rarer, but if


you’re coming down with ‘flu there’s a 50/50
chance you’ll have a headache.”
Bayes Theorem - Visualization
• Set interpretation
– P(A) is size of set A in the world
– P(A,B) is the size of the intersection of set A&B
– P(A|B) is the fraction of the space where B is
true that A is also true
H = “Have a headache”
F = “Coming down with Flu”
Think fraction of H occupied by F
P(H) = 1/10 Answer is 1/8th. Not fraction of F occupied by H.
P(F) = 1/40
P(H|F) = 1/2 Good Reasoning?!
One day you wake up with a headache. You
think: “Drat! 50% of flus are associated with
headaches so I must have a 50-50 chance of
coming down with flu”
From Bayes Theorem to Naïve Bayes
p(D | H )p(H )
p(H | D) =
p(D)
p(D | H )p(H ) = p(D, H )
• Bayes Theorem: p(D) = ∑ p(D, H )
H
– P(H) “Prior probability of hypothesis H”
– P(D|H) “Probability of data D given hypothesis H”
• What if we have two or more sources of data?
Recall: p(A, B) = p(A)p(B)
iff independent
• Then either
p(D1, D2 | H )p(H ) p(D1 | H )p(D2 | H )p(H )
p(H | D1, D2 ) = p(H | D1, D2 ) =
p(D1, D2 ) p(D1, D2 )
• Using the latter regardless is known as the
“Naïve” Assumption
From Bayes Theorem to Naïve Bayes

• A direct Bayesian classifier would have to model:


– P(“Viarga”=1, “Cheap”=1,..|Spam)=….
– P(“Viarga”=0, “Cheap”=1,..|Spam)=….
– P(“Viarga”=1, “Cheap”=0,..|Spam)=….
p(D1, D2 | H )p(H )
– P(“Viarga”=0, “Cheap”=0,..|Spam)=…. p(H | D1, D2 ) =
p(D1, D2 )
– ….
– P(“Viarga”=1, “Cheap”=1,..|Ham)=….
– P(“Viarga”=0, “Cheap”=1,..|Ham)=….
– P(“Viarga”=1, “Cheap”=0,..|Ham)=….
– P(“Viarga”=0, “Cheap”=0,..|Ham)=….
• Clearly the table size, and hence data requirement is
exponential in the size of the dictionary. Cf: 280,000!
Naïve Bayes Classifier

n Naïve Bayes spam classification ∏ p(D | H )p(H )


i
n P(“Viarga”|Spam)=90% p(H | D1.. N )= i

n P(“Viagra”|Ham) =5% ∏ p(D ) i


i
n P(“Cheap”|Spam)=60%
n P(“Cheap”|Ham) =30%
n P(Spam)=10%
n P(Ham)=90%

n P(Spam|Cheap) = p(C|S) p(S)/Z = 0.6*0.1/(0.6*0.1+0.3*0.9) = 18%

n P(Spam|Viagra) = p(V|S) p(S)/Z = 0.9*0.1/(0.9*0.1 + 0.05*0.9) = 67%

n P(Spam|Cheap,Viagra) = p(V|S)p(C|S)p(S)/Z
n = 0.6*0.9*0.1/(0.6*0.9*0.1+0.3*0.05*0.9) = 80%
Naïve Bayes Classifier: Continous Data

n For continuous data, often model p(D|H) as Gaussian


n P(S|x*col,x*len)=p(x*col|S)p(x*len|S)p(S)/K p(x | µ, σ ) =
1 1
exp− 2 ( x − µ )
2

σ 2π 2σ
n P(C|x*col,x*len)=p(x*col|C)p(x*len|C)p(C)/K

p(Len | Salm.) p(Color | Salmon) p(Color | Cod)

xlen

*?

p(Len | Cod) xcolor


Learning Naïve Bayes Classifier:
Discrete

n To learn the NB classifier, need to fit probability distributions


n Observe a coin with H,H,H,T,T.
n p(Heads | Coin)=3/5, 60%, p(Tails | Coin)=2/5, 40%.
n Roll Dice 60 times, observe: 12x1, 8x2, 11x3, 9x4, 14x5, 6x6
n P(1|Dice) = 20%, …, p(6|Dice)=10%.

n This is called a binomial/multinomial distribution.


n Parameter tells you the bias. [0.6, 0.4], [0.2,0.13,0.18,0.15,0.23.0.1]

n Find the parameter that maximizes the probability of the data


n Wcoin = argmax p(D|Wcoin)
n (Nj is counting number of outcomes of type j) wj = N j / ∑ N j
j
Learning Naïve Bayes Classifier:
Discrete: Math and Pseudocode

n Find the parameter that maximizes the


probability of the data
n W = argmax p(D|W)

n Foreach attribute k
w jk = N jk / ∑ N jk n Foreach Data i
j
n Foreach state j
N jk = ∑ I(xik = j) n N(j,k) +=1 if xik = j
i n Make N(:,k) sum to 1
Learning Naïve Bayes Classifier:
Continuous

n To learn the NB classifier, independently find the parameter that maximizes the
probability of the training data

n For Gaussian: 1 1 2
p(x | µ, σ ) = exp− 2 ( x − µ )
n {μ,σ}=argmax p(D|μ,σ) σ 2π 2σ

n D={<xl,xc,fish>} 1 N 1 N
µ = ∑x 2
σ = ∑ (xi − µ )2
n ={<0.1,0.3,cod>,<0.2,0.4,cod>,.. N i N −1 i
n <0.3,0.2,salm>,<0.4,0.3,salm>}

n Then p(Len | Salm.) p(Color | Salmon) p(Color | Cod)

n Cod μlen = (0.1+0.2+…)/N


n Salmon μlen = (0.3+0.4+….)/N, etc. xlen
*?
p(Len | Cod)
xcolor
Naïve Bayes Classifier: Over-fitting
• What if you had ten spams and no real emails
with “viagra”?
– Our parameter estimate equation:
p(x j ) = N j / ∑ N j
– P(Viagra|Spam)=10/10+0=100% j
– P(Viagra|Ham)=0/10+0=0%
• Now you get a long email from a friend that
happens to mention Viagra:
– The spam evidence from one “Viagra” overrides every
other indication of ham from the email. (Multiply by
zero)
– How to fix? ∏ p(D | H )p(H ) i
i
p(H | D1.. N ) =
∏ p(D ) i
i
Naïve Bayes Classifier: Regularization
• What if you had ten spams and no real emails
with “Viagra”?
– MLE Learning w = N / ∑N j j j
• P(Viagra|Spam)=10/10+0=100% j

• P(Viagra|Ham)=0/10+0=0%
– Regularized Learning, λ=1 w j = ( N j + λ ) / ∑( N j + λ )
j
• P(Viagra|Spam)=10+1/11+1=92%
• P(Viagra|Ham)=1/11+1=8%
• Now, with enough positive evidence, an email
could be Ham despite including Viagra. ∏ p(D | H )p(H ) i
i
p(H | D1.. N ) =
∏ p(D ) i
i
NB Classifier: What can it classify?
• For continous data
– Naïve Bayes models a line (or quadratic curve)
• For discrete data
– Naïve Bayes models a line. (just like MaxEnt!)
• => Simpler boundary than DT p(Len | Salm.) p(Color | Salmon) p(Color | Cod)

xlen
*?
p(Len | Cod)
Naïve Bayes Issues: Overconfidence
• Naïve assumption:
– Counting each piece of evidence equally
– Not exploiting attribute correlation
p(D1, D2 | H ) = p(D1 | H )p(D2 | H )
p(D1.. N | H ) = ∏ p(Di | H )
i

• You could attack a spam filter by listing all the


fish species below your Viagra add…
Naïve Bayes Classifier: Relation to
MaxEnt
• Both classifiers have simple boundaries
• For data D={yi,xi}
• MaxEnt: w = argmax E (w, D) = ∏ p(y | x , w)
*
MCL
i
i i

• Naïve Bayes: w = argmax E (w, D) = ∏ p(x | y , w)


*
ML i i
p(D | H )p(H )
i p(H | D) =
• NB Learning decouples the prior: p(D)

– You can take your NB cancer classifier to Chernobyl and it


will still work…
– You can move your NB fish classifier from UK to Norway…
– Your MaxEnt cancer classifier will have to re-train from
scratch
An Aside: Online Learning

• Sometimes you want to learn from a data stream instead of from a


pre-existing static database.
1. Because you want to keep your model very up-to-date.
2. Because your database is too huge to fit in memory, and you don’t want to
read it off disk more than once.
– Thin task is know as online learning.
• Any algorithm can be re-trained from scratch every time a new row is
added from the stream.
– E.g., MaxEnt you repeat your O(dn) training for each of n data itmes.
– Inefficient!! Leads to n*O(dn)=O(dn2)
• An algorithm that can update the model from the stream in O(1) (i.e.,
without revisiting the old database) has the Incremental property.
Naïve Bayes Classifier: Online Learning

• Naïve Bayes is naturally online incremental!


• If you want to learn from a continuous stream of
observations
– Maintain your sufficient statistics Nj
• (i.e., how many times each token j is associated with the
current class)
– Add +1 to the appropriate Nj each new observation xj

• There is corresponding regularization for the


continuous version. w | D = ( N + λ ) / ∑( N + λ)
j j j
j

w j | D, D' = ( N j + N j '+ λ ) / ∑ ( N j + N ' j + λ )


j
Naïve Bayes Classifier: Summary
• Naïve Bayes Classifier: • Properties:
• Representation: Likelihoods Bayes • Train Complexity: O(dn)
• Evaluation: Likelihood • Test Complexity: O(d)
• Train:
• Good in high dimensions
• Exact, maximum likelihood
• Even d>n
• (Each attribute independently)
• Set the prior manually or ML • Good for Big Data

• Test: Maximum A-Posteriori • Incremental online


• One-pass

p(H | D1, D2 ) =
p(D1 | H )p(D2 | H )p(H ) • Can change priors
p(D1, D2 ) • Good for mixed-type data
Case Studies: View Angle Classification
(EECS Work! J)
Decision tree classifies face pixels, predicts view direction.
Classification: Overview

• Decision Trees
• Naïve Bayes
• Performance Metrics
Performance Metrics

• Right metric to use depends on the


application
– Misuse of metric can be very misleading. So
better understand them!
• Accuracy
• Confusion Matrix
• Expected Utility
• ROC Curve
Performance Metrics:
Accuracy

• If classifier makes predictions yest and the true values


are ytru
• Accuracy: Percentage of correct answers
• Advantages:
N
1
Acc = ∑ I(yiest = yitru )
N
– Easy, single number i

• Limitations:
– Doesn’t account for imbalanced data:
• E.g., Loans: 90% of people overall pay back their loan
• Bank classifies good/bad borrowers to make lending decisions
• If classify all as good => 90% “accurate” …but useless!
– Doesn’t account for which mistakes are made
– Doesn’t account for classifier calibration
Performance Metrics:
Confusion Matrix

– The confusion matrix compares how many instances of each actual


category are predicted as each estimated category.

• The sum of the confusion matrix diagonal gives the accuracy.

– (Accuracy = % of correct answers, 7/10 = 70% in this example)

Actual Actual
1 0 1 0
TP FP 4 1
Predicted

Predicted
1

1
FN TN 2 3
0

0
Performance Metrics:
Confusion Matrix
– The confusion matrix compares how many instances of each actual category are
predicted as each estimated category.
• Sometimes which mistakes you make matter more
than the total number of mistakes
– E.g., Loans. Predicting good/bad credit
• Consider two classifier results
– Accuracy = 50% in each case
– Both classifiers get the bank 3 loans worth of interest
payments
– But which is more useful?
– Classifier A: Lost business: 1, Bad Loans: 4 Actual Actual

– Classifier B: Lost business: 4, Bad Loans: 1 G B G B

Predicted 3 4 3 1
G
B 1 2 4 2
Performance Metrics: Confusion Matrix
• Accuracy results can mislead if you have imbalanced
data
– Normalised confusion matrix can reveal this
– E.g., Assume 90% loans are good, so an accurate classifier
reports 100% good loans. Overall Accuracy = 9/10 = 90%. But
it’s useless.
– True Positive Rate (Frac of positives identified as positive)
– True Negative Rate (Frac of negatives identified as negative)
– Old diagonal: (TP+FP)/N=90%. New diagonal: (TPR+FPR)/2=50%
Actual Actual Actual Actual
1 0 1 0 G B G B
TP FP 9 1 1.0
Predicted

1.0
Predicted

TPR FPR
G
1

G
1

FN TN FNR TNR 0 0 0 0
B
0

B
0
Performance Metrics: Confusion Matrix
• Different applications care about different parts
of the confusion matrix.
– E.g., bank cares more about minimizing FPR (bad loans) than FNR (lost business)

– E.g., High security system cares more about minimizing


FNR (permitted breakins) than FPR (false alarms).
• Why? Because each outcome has a different cost.

Actual Actual Actual Actual


1 0 1 0 G B G B
TP FP 3 1
Predicted

1/4
Predicted

TPR FPR 3/4


G
1

G
1

FN TN FNR TNR 4 2 4/6 2/6


B
0

B
0
Performance Metrics: Expected Value
• Which loan classifier is better, and by how much?
– A makes more good loans, but B makes less bad loans
– Expected Value calculation gives a single number
given a confusion matrix and cost matrix
EV = P(Outcome1)*Val(Outcome1)+P(Outcome2)*Val(Outcome2)
Actual
G B
ts EV = (2*3 - 0.1*1 - 4*4 – 0.1*2)/10
o s Actual 3 4
C = -1$ per customer
Predicted
G

G B
1 2
B
Predicted

$2 -4$
G

G B
-$0.1 -0.1$ EV = (2*2 - 0.1*2 - 0*4 - 0.1*6)/10
B

Predicted

2 0 = 0.32$ per customer


G

2 6
B
Performance Metrics: ROC

• All of the metrics discussed so far depend on classifier calibration


– You are correct if your final estimate is ytru=yest
– But good binary classifiers can output a confidence as well as a class…
– By default the classifier says: yest=1 if p(y)>0.5
• If you are worried about FPs, you could say: yest=1 if p(y) > 0.75
• If you want to maximize TPs you could say: yest=1 if p(y) > 0.25
• This threshold will change the distribution in the confusion matrix
– Since threshold is user/business context dependent…..
• Is there a way to evaluate a classifier independently of the
threshold?
– So we can evaluate independently of the end user. Predicted
1 0
TP FN

1
Actual
FP TN

0
Performance Metrics: ROC
• Consider a variety of thresholds
– Each threshold defines a TPR and FPR

P(Good|x)
Actual 1
1 0
TPR
Predicted

TPR FPR
1

FNR TNR
0

FPR 1 Bad loans x Good Loans


Performance Metrics: ROC
• Consider a variety of thresholds
– Each threshold defines a TPR and FPR
– The ROC curve is the graph of TPRs and FPRs
• (Receiver operating characteristic)

P(Good|x)
Actual Actual 1
1 0 1 0
TPR
Predicted

2 2 1 1
1

0 0 1 1
0

FPR 1 Bad loans x Good Loans


Performance Metrics: ROC
• Consider a variety of thresholds
– Each threshold defines a TPR and FPR
– The ROC curve is the graph of TPRs and FPRs
• (Receiver operating characteristic)
– Better ROC curves approach the top left
– Area under the ROC curve is a threshold-independent
measure of goodness
• (AUROC: Perfect:
t
1, Worst: 0, Random: 0.5)
rfe
c P(Good|x)
Actual Actual Pe
1
1 0 1 0
TPR
Predicted

1 0 0 0

1 2 2 2
FPR 1 Bad loans x Good Loans
Summary

• Understand meaning and uses of different


performance metrics:
– Accuracy, confusion matrix, expected value and
ROC curve
Summary: You Should Know

• What is the process for classification using a Decision


Tree?
• Sketch an algorithm to learn a Decision Tree
• What is the process for classification using Naïve
Bayes
• Sketch the algorithm to learn Naïve Bayes
• Pros and cons of Decision Tress versus Naïve Bayes
• Understand meaning and uses of different
performance metrics:
– Accuracy, confusion matrix, expected value and ROC
curve

You might also like