05-Classification-II-2024
05-Classification-II-2024
Classification – part 2
Dr. Ioannis Patras
School of EECS
• Decision Trees
• Naïve Bayes
• Practical Issues and performance metrics
Decision Trees:
Play tennis dataset
Decision Trees: Contingency Tables
• For every combination of attributes, record
how frequently it occurs
• Check the cube to predict new data
– Would be slow
ast
Ov y
iny
nn
– Decision tree can compress the cube
erc
Ra
Su
Humid
Not Humid
Play
No Play
Decision Trees: Model Structure & Test
time procedure
• Internal nodes:
– Test the value of a particular attribute: Equality /
Inequality.
– Branch according to the result
• Leaf nodes:
– Specify the class f(x)
• Test time:
Classify x* by sending it down the tree
Decision Tree: How to Grow/Train
• What algorithm do you think can construct a
tree from data?
– Hint: It’s recursive.
• Suppose you have a magic pick-best attribute
function?
Decision Tree: How to Grow/Train
Recursive Algorithm:
• Grow(T)
– if All y=0, return Leaf(0)
– elseif All y=1, return Leaf(1)
– else
• xj = ChooseBestAttribute(T)
• T0 = <x,y> in T with xj=0
• T1 = <x,y> in T with xj=1
• Return Node(xj, Grow(T0), Grow(T1))
Grow/Train: How to choose best
attribute?
• Pick the attribute that greedily maximizes
accuracy?
– In this example, x1
• j = ChooseBestAttribute(T)
– Choose j to minimize
– #Examples <x,y> in T0 with y!=0
+
– #Examples <x,y> in T1 with y!=1
• (Actually, minimize information gain instead)
Entropy and Information Gain
Entropy of a random variable Y High Entropy
𝐻 𝑌 = − ∑! 𝑝 𝑦 log(𝑝 𝑦 ))
𝐻 𝑌 = − ∑! 𝑝 𝑦 log(𝑝 𝑦 ))
𝑌" distribution at
the right branch
𝑌! distribution at
the left branch
• Nominal
– Test one value versus all the others
(Outlook=Sunny)
– Group into disjoint subsets. (Postcode = W1)
• Continuous
– Threshold inequality xj > th
Decision Tree: What can they
represent? (Nominal Data)
• Depth 1 tree p
o
o
Decision Tree: Regularization
Avoiding Over-fitting for a decision tree. Ideas?
1. Grow full tree then prune
– How to guide pruning?
• Measure performance on train data?
• Measure performance on validation data?
2. Add regularizer to split objective
– xj = ChooseBestAttribute(T)
– If error improvement < λ* #nodes
• Then skip
– (Determine λ by validation)
Decision Tree: Summary
• Decision Tree Classifier: Properties:
• Good:
• Representation: Tree – Mixed-type data (no 1-of-N
encode!)
• Evaluation: Accuracy – High dimensions
• Train: Greedy, Recursive • Classification at test time can
take < O(d)!
• Test: Traverse tree – CF: NN: O(dn), MaxEnt: O(d)
• Frequently used in industry
• Prevent overfit: • May be Interpretable
• Regularize on # of nodes, or • Optimal tree is NP-complete
– Practical trees are not
• Pruning optimal, but good enough
• Some pathological problems
can’t be represented as trees
Classification: Overview
• Decision Trees
• Naïve Bayes
– Bayes Theorem
– ML fitting
• Practical Issues
Naïve Bayes – Bayes Theorem Recap
• Bayes Theorem
– P(A) “Prior probability of A” p(A | B) =
p(B | A)p(A)
p(B)
– P(B|A) “Probability of B given A”
p(B | A)p(A) = p(A, B)
p(A) = ∑ p(A, B)
B
Naïve Bayes – Bayes Theorem Recap
• Bayes Theorem:
– P(H) “Prior probability of hypothesis H”
– P(D|H) “Probability of data D given hypothesis H”
p(D | H )p(H )
p(H | D) =
p(D)
P(H) = 1/10
P(F) = 1/40
P(H|F) = 1/2
n P(Spam|Cheap,Viagra) = p(V|S)p(C|S)p(S)/Z
n = 0.6*0.9*0.1/(0.6*0.9*0.1+0.3*0.05*0.9) = 80%
Naïve Bayes Classifier: Continous Data
σ 2π 2σ
n P(C|x*col,x*len)=p(x*col|C)p(x*len|C)p(C)/K
xlen
*?
n Foreach attribute k
w jk = N jk / ∑ N jk n Foreach Data i
j
n Foreach state j
N jk = ∑ I(xik = j) n N(j,k) +=1 if xik = j
i n Make N(:,k) sum to 1
Learning Naïve Bayes Classifier:
Continuous
n To learn the NB classifier, independently find the parameter that maximizes the
probability of the training data
n For Gaussian: 1 1 2
p(x | µ, σ ) = exp− 2 ( x − µ )
n {μ,σ}=argmax p(D|μ,σ) σ 2π 2σ
n D={<xl,xc,fish>} 1 N 1 N
µ = ∑x 2
σ = ∑ (xi − µ )2
n ={<0.1,0.3,cod>,<0.2,0.4,cod>,.. N i N −1 i
n <0.3,0.2,salm>,<0.4,0.3,salm>}
• P(Viagra|Ham)=0/10+0=0%
– Regularized Learning, λ=1 w j = ( N j + λ ) / ∑( N j + λ )
j
• P(Viagra|Spam)=10+1/11+1=92%
• P(Viagra|Ham)=1/11+1=8%
• Now, with enough positive evidence, an email
could be Ham despite including Viagra. ∏ p(D | H )p(H ) i
i
p(H | D1.. N ) =
∏ p(D ) i
i
NB Classifier: What can it classify?
• For continous data
– Naïve Bayes models a line (or quadratic curve)
• For discrete data
– Naïve Bayes models a line. (just like MaxEnt!)
• => Simpler boundary than DT p(Len | Salm.) p(Color | Salmon) p(Color | Cod)
xlen
*?
p(Len | Cod)
Naïve Bayes Issues: Overconfidence
• Naïve assumption:
– Counting each piece of evidence equally
– Not exploiting attribute correlation
p(D1, D2 | H ) = p(D1 | H )p(D2 | H )
p(D1.. N | H ) = ∏ p(Di | H )
i
p(H | D1, D2 ) =
p(D1 | H )p(D2 | H )p(H ) • Can change priors
p(D1, D2 ) • Good for mixed-type data
Case Studies: View Angle Classification
(EECS Work! J)
Decision tree classifies face pixels, predicts view direction.
Classification: Overview
• Decision Trees
• Naïve Bayes
• Performance Metrics
Performance Metrics
• Limitations:
– Doesn’t account for imbalanced data:
• E.g., Loans: 90% of people overall pay back their loan
• Bank classifies good/bad borrowers to make lending decisions
• If classify all as good => 90% “accurate” …but useless!
– Doesn’t account for which mistakes are made
– Doesn’t account for classifier calibration
Performance Metrics:
Confusion Matrix
Actual Actual
1 0 1 0
TP FP 4 1
Predicted
Predicted
1
1
FN TN 2 3
0
0
Performance Metrics:
Confusion Matrix
– The confusion matrix compares how many instances of each actual category are
predicted as each estimated category.
• Sometimes which mistakes you make matter more
than the total number of mistakes
– E.g., Loans. Predicting good/bad credit
• Consider two classifier results
– Accuracy = 50% in each case
– Both classifiers get the bank 3 loans worth of interest
payments
– But which is more useful?
– Classifier A: Lost business: 1, Bad Loans: 4 Actual Actual
Predicted 3 4 3 1
G
B 1 2 4 2
Performance Metrics: Confusion Matrix
• Accuracy results can mislead if you have imbalanced
data
– Normalised confusion matrix can reveal this
– E.g., Assume 90% loans are good, so an accurate classifier
reports 100% good loans. Overall Accuracy = 9/10 = 90%. But
it’s useless.
– True Positive Rate (Frac of positives identified as positive)
– True Negative Rate (Frac of negatives identified as negative)
– Old diagonal: (TP+FP)/N=90%. New diagonal: (TPR+FPR)/2=50%
Actual Actual Actual Actual
1 0 1 0 G B G B
TP FP 9 1 1.0
Predicted
1.0
Predicted
TPR FPR
G
1
G
1
FN TN FNR TNR 0 0 0 0
B
0
B
0
Performance Metrics: Confusion Matrix
• Different applications care about different parts
of the confusion matrix.
– E.g., bank cares more about minimizing FPR (bad loans) than FNR (lost business)
1/4
Predicted
G
1
B
0
Performance Metrics: Expected Value
• Which loan classifier is better, and by how much?
– A makes more good loans, but B makes less bad loans
– Expected Value calculation gives a single number
given a confusion matrix and cost matrix
EV = P(Outcome1)*Val(Outcome1)+P(Outcome2)*Val(Outcome2)
Actual
G B
ts EV = (2*3 - 0.1*1 - 4*4 – 0.1*2)/10
o s Actual 3 4
C = -1$ per customer
Predicted
G
G B
1 2
B
Predicted
$2 -4$
G
G B
-$0.1 -0.1$ EV = (2*2 - 0.1*2 - 0*4 - 0.1*6)/10
B
Predicted
2 6
B
Performance Metrics: ROC
1
Actual
FP TN
0
Performance Metrics: ROC
• Consider a variety of thresholds
– Each threshold defines a TPR and FPR
P(Good|x)
Actual 1
1 0
TPR
Predicted
TPR FPR
1
FNR TNR
0
P(Good|x)
Actual Actual 1
1 0 1 0
TPR
Predicted
2 2 1 1
1
0 0 1 1
0
1 0 0 0
1 2 2 2
FPR 1 Bad loans x Good Loans
Summary