0% found this document useful (0 votes)
63 views72 pages

CSCI946 W5-Classification

The document outlines the Week 5 lecture on Classification in the CSCI446/946 Big Data Analytics course at the University of Wollongong. It covers various classification methods such as K-Nearest Neighbor, Multi-Layers Perceptron, Decision Trees, and Naïve Bayesian Classifiers, along with their applications and performance indicators. The lecture also includes a recap of clustering analysis techniques like K-means, DBSCAN, and Self-Organizing Maps.

Uploaded by

Masud Zaman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views72 pages

CSCI946 W5-Classification

The document outlines the Week 5 lecture on Classification in the CSCI446/946 Big Data Analytics course at the University of Wollongong. It covers various classification methods such as K-Nearest Neighbor, Multi-Layers Perceptron, Decision Trees, and Naïve Bayesian Classifiers, along with their applications and performance indicators. The lecture also includes a recap of clustering analysis techniques like K-means, DBSCAN, and Self-Organizing Maps.

Uploaded by

Masud Zaman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CSCI446/946 Big Data Analytics

Week 5 – Lecture: Classification

School of Computing and Information Technology


University of Wollongong Australia
Spring 2024
Content

• Brief Recap
– Clustering Analysis
• K-means, DBSCAN, SOM
• Classification
– Overview
– K-Nearest Neighbor (KNN)
– Multi-Layers Perceptron (MLP)
– Decision Tree (DT)
– Naïve Bayesian Classifier
• Diagnostics and Performance Indicators
Content

• Brief Recap
– Clustering Analysis
• K-means, DBSCAN, SOM
• Classification
– Overview
– K-Nearest Neighbor (KNN)
– Multi-Layers Perceptron (MLP)
– Decision Tree (DT)
– Naïve Bayesian Classifier
• Diagnostics and Performance Indicators
K-means Clustering

18/08/2024 4
K-means Clustering
• Application to image processing
K-means Clustering
• Application to image processing

Original K=2
K=3
K=10
DBScan
• Given a density threshold (𝑀𝑖𝑛𝑃𝑡𝑠) and a radius (𝐸𝑝𝑠),
the points in a dataset are classified into three types:
core point, border point, and noise point.
– Core points: 𝑃𝑜𝑖𝑛𝑡 𝑤ℎ𝑜𝑠𝑒 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 >= 𝑀𝑖𝑛𝑃𝑡𝑠
– Core points are in the interior of a density-based cluster.

Example: If 𝑀𝑖𝑛𝑃𝑡𝑠 =
6 then A is a core point
because its density = 7
(7>6)
DBScan Example
Original Points

Eps = 10, MinPts = 4

Mark core, border and noise points Mark connected core points
Self-Organizing Maps
• Self-organizing maps have two
layers:
– An input layer and
– An output layer called the feature
map.
• The feature map consists of
neurons.
– organized on a regular grid.
– Unlike other ANN types, the neurons
in a SOM don’t have an activation
function.

• Each neuron in a SOM is assigned a


weight vector with the same
dimensionality as the input space.
Self-Organizing Maps
• SOMs are an excellent choice for data
visualization
– Dimension reduction techniques
• Why use Self-Organizing Maps (SOMs) in BDA?
– Topology preservation (unlike PCA)
– Able to deal with new data & missing values (unlike t-SNE)
• When not to use SOMs in BDA:
– When the data is very sparse
– When cardinality (limited resolution) of the map is a
problem.
Content

• Brief Recap
– Clustering Analysis
• K-means, DBSCAN, SOM
• Classification
– Overview
– K-Nearest Neighbor (KNN)
– Multi-Layers Perceptron (MLP)
– Decision Tree (DT)
– Naïve Bayesian Classifier
• Diagnostics and Performance Indicators
Overview of Classification
• Classification is a fundamental learning
method that appears in applications related to
data mining
• The primary task performed by classifiers is to
assign class labels to new observations
– Sets: training, (validation), testing
• Classification methods are supervised
– Start with a training set of labelled observations
– Predict the outcome for new observations
Overview of Classification
• Example of classifiers:
– K-nearest neighbour (KNN): model free classifier
– Neural Networks (NN): Massive parallel nonlinear
parametric methods
– Decision Tree and Random Forests: Makes explanatory
if-then decisions
– Naïve Bayes (NB) Classifier: Probabilistic methods
– Logistic regression: Linear method (LR)
– Support Vector Machines (SVM): non-parametric
classifiers.
–…
Nearest-Neighbor Classifiers
Unknown record Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve

To classify an unknown record:


– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
Definition of Nearest Neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points


that have the k smallest distance to x
Nearest Neighbor Classification
• Compute distance between two points:
– Euclidean distance
d ( p, q ) =  ( pi
i
−q )
i
2

• Determine the class from nearest neighbor list


– take the majority vote of class labels among the k-
nearest neighbors
– Weigh the vote according to distance
• weight factor, w = 1/d2
Nearest Neighbor Classification…
• Choosing the value of k:
– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points from
other classes
– Computational cost often increases when k increases

X
Neural Networks - MLP
Neural Networks - MLP
Neural Networks - MLP

Bias inputs are not shown


Neural Networks - MLP
• Weights are initially unknown
– Initialized with small values
– Are updated by a learning algorithm.
• NN can produce a non-linear mapping of the
input to the output.
– Coding of attribute values is non-critical as long as
the inputs are numeric.
– Inputs for NN are often normalized. Why?
– Few exceptions: i.e. SOM
Neural Networks - MLP
• Main challenges:
– Network design:
• How to organize the neurons?
• How many layers, how many neurons in each layer?
• Which activation function?
– Learning algorithm:
• How to update the weights?
• How to update the weights effectively?
Neural Networks - MLP
• It has been proven:
– Three layers are enough (if neurons are linear)
– Two layers are enough (if neurons are non-linear).
• Common activation functions:
Neural Networks - MLP
• Weight updates
– Compute the network error 𝐸 = σ𝑛𝑖 𝑜𝑖 − 𝑡𝑖 2 ,where 𝑜𝑖 is the i-th
network output, and 𝑡𝑖 the desired network output (the target).
• When updating the weights, the aim is to minimize 𝐸 for all inputs.
• Many algorithms are based on gradient descent methods.
𝛿𝐸
• Update weights: Δ𝑤𝑖𝑗 = −𝛼 , where 𝛼 ∈ 0,1 is a learning
𝛿𝑤𝑖𝑗
rate.
• Repeat for a number of epochs:
– Select a training sample
– Compute the output, then compute the error.
– Compute the gradient then update the weights.
Neural Networks – MLP vs DNN
• Formally it has been proven:
– Three layers are enough (if neurons are linear)
– Two layers are enough (if neurons are non-linear)

• A surprise: Deep Neural Networks


– For complex problems it was found that deep NN are
much better.
• Many layers (possibly hundreds)
• CNN (1995)
• Breakthrough after 2000 (massive parallel GPUs)
Neural Networks in R
• Example: A training dataset

• Can a neural network predict placement given knowledge


score and communication score of a student?
Neural Networks in R
Neural Networks in R
Neural Networks in R

• Your results may vary. Why?


• We expected results such as 0s and 1s. What to do?
Neural Networks in R
Reason to choose
• Neural Nets are massive parallel systems
– Can be implemented efficiently on multi-core (i.e. GPU)
systems.
– Trained models are computationally very efficient when
processing new inputs.
• Neural Nets can solve a wide range of problems, and can
classify samples into an arbitrary number of classes.
– NN perform better than humans on growing number of tasks
(i.e. playing chess, Go, lip reading,...)
• Limited data pre-processing required.
• Insensitive to noise
• Often a tool of choice in Big Data Analytics.
Caution
• Most supervised Neural Networks are “black box” classifiers.
– They are unable to show or explain how a result came to be.
– i.e. what in the input caused the network to respond in a certain
way?
• They have problems with unbalanced learning problems.
– i.e. when there are many more samples in one class than in
another class.
• The model is prone to overfit the training data when choosing too
many neurons and/or layers.
– Performance may be sub-optimal if choosing to few neurons or
layers.
– Finding the best number of neurons and layers is an art.
• Training can be time consuming.
– They tend to require a lot of training samples to perform well.
Decision Tree
• A decision tree uses a tree structure to specify
sequences of decisions and consequences
• Given input variable X = {x1,x2,…,xn}, the goal is
to predict an output variable Y
Decision Tree
• Each node tests a particular input variable
• Each branch represents the decision made
• Classifying a new observation is to traverse
this decision tree.
Decision Tree
• The depth of a node is the minimum number
of steps required to reach the node from root
• Leaf nodes are at the end of the last branches
on the tree, representing class labels
The General Algorithm of DT
• The objective of a decision tree algorithm
– Construct a tree T from a training set S
• The algorithm picks the most informative
attribute to branch the tree and does this
recursively for each of the sub-trees.
• The most informative attribute is identified by
– Information gain, calculated based on Entropy
The General Algorithm of DT
• Entropy

Question:
In a bank marketing dataset, there are 2000
customers in total. Among them, 1789
subscribed term deposit. What is the entropy of
the output variable “subscribed” (Hsubscribed)?
The General Algorithm of DT
• Conditional entropy
The General Algorithm of DT
• Information gain

• It compares
– The degree of purity of the parent node before a split
– The degree of purity of the child node after a split
The General Algorithm of DT

• The algorithm constructs sub-trees recursively


until one of the following criteria is met
– All the leaf nodes in the tree satisfy the minimum
purity threshold (i.e., are pure enough)
– There is no sufficient information gain by splitting
on more attribute (i.e., not worth anymore)
– Any other stopping criterion is satisfied (such as
the maximum depth of the tree)
Decision Tree
• An example: A bank markets its term deposit
product. So the bank needs to predict which
clients would subscribe to a term deposit
– The bank collects a dataset of 2000 previous
clients with known “subscribe or not”.
– Input variables to describe each client are
• Job, marital status, education level, credit default,
housing loan, personal loan, contact type, previous
campaign contact
Decision Tree


The training dataset of the bank example
Decision Tree
From your point of view,
what is the most
important issue in
building a decision tree?

A decision tree built over the bank marketing training dataset


The General Algorithm of DT
• Assume the attribute X is “contact”
– Its value x takes one value in {cellular,
telephone, unknown}
• The outcome Y is “subscribed”
– Its value y takes one value in {no, yes}
The General Algorithm of DT
The General Algorithm of DT

• The algorithm splits on the attribute with the


largest information gain at each round
The General Algorithm of DT
• Information gain

• It compares
– The degree of purity of the parent node before a split
– The degree of purity of the child node after a split
Properties of Decision Tree

• Computationally inexpensive, easy to classify


• Classification rules can be understood
• Handle both numerical and categorical input
• Handle variables that have a nonlinear effect
on the outcome, better than linear models
• Not a good choice if there are many irrelevant
input variables
– Feature selection will be needed
Caution

• Decision tree uses greedy algorithms


– It always chooses the option that seems the best
available at that moment
– However, the option may not be the best overall
and this could cause overfitting
– An ensemble technique can address this issue by
combining multiple decision trees that use
random splitting
Evaluating a Decision Tree

• Evaluate a decision tree


– Evaluate whether the splits of the tree make sense
and whether the decision rules are sound (say,
with domain experts)
– Having too many layers and obtaining nodes with
few members might be signs of overfitting
– Use standard diagnostics tools for classifiers
Naïve Bayes Classifier
• A probabilistic classification method based on
Bayes’ theorem
• A naïve Bayes classifier assumes that the
presence or absence of a particular feature of
a class is unrelated to the presence or absence
of other features (conditional independence assumption)
• Output includes a class label and its
corresponding probability score
Naïve Bayes Classifier
on Bayes’ Theorem
Thomas Bayes
1702-1761
Bayes’ Theorem
• A more practical form of Bayes’ theorem

• Given A, how to calculate P(ci|A)?


Naïve Bayes Classifier
• With two simplifications, Bayes’ theorem
induces a Naïve Bayes classifier
• First, Conditional independence assumption
– Each attribute is conditionally independent of
every other attribute given a class label ci

– This simplifies the computation of P(A|ci)


Naïve Bayes Classifier
• Second, ignore the denominator P(A)
– Removing the denominator has no impact on the
relative probability scores
• In this way, the classifier becomes
Caution
• An issue on rare event
– What if one of the attribute values does NOT
appear in a class ci in a training dataset?
– P(aj|ci) for this attribute value will equal zero!
– P(ci|A) will simply become zero!
• Smoothing technique
– It assigns a small nonzero probability to rare
events not included in a training dataset
Naïve Bayes Classifier
• Laplace smoothing (add-one smoothing)
– It pretends to see every outcome once more than
it actually appears
Naïve Bayes Classifier
• Advantages
– Simple to implement, commonly used for text
classification
– Handle high-dimensional data efficiently
– Robust to overfitting with smoothing technique
• Disadvantages
– Sensitive to correlated variables (Why?)
– Not reliable for probability estimation
Naïve Bayes Classifier
• An example
– With the bank marketing dataset, use Naïve Bayes
Classifier to predict if a client would subscribe to a term
deposit
• Building a Naïve Bayes classifier requires to calculate
some statistics from training dataset
– P(A|ci) for each class i=1,2,…,n
– P(aj|ci) for each attribute j=1,2,…,m in each class
Naïve Bayes Classifier
• P(A|ci) for each class

• P(aj|ci) for each attribute


in each class
Naïve Bayes Classifier
• Testing a Naïve Bayes classifier on a new data
Naïve Bayes in R
• Two methods
– Build the classifier from the scratch
– Call naiveBayes function from e1071 package
Naïve Bayes in R
Content

• Brief Recap
– Clustering Analysis
• K-means, DBSCAN, SOM
• Classification
– Overview
– K-Nearest Neighbor (KNN)
– Multi-Layers Perceptron (MLP)
– Decision Tree (DT)
– Naïve Bayesian Classifier
• Diagnostics and Performance Indicators
Diagnostics
• Holdout method
– Given data is randomly partitioned into two independent
sets
• Training set (e.g., 80%) & Test set (e.g., 20%)
– Random sampling: a variation of holdout
• Repeat holdout k times, avg. + std accuracy
• Cross-validation (k-fold, where k = 10 is most popular)
– Randomly partition the data into k mutually exclusive
subsets, each approximately equal size
– At i-th iteration, use Di as test set and others as training set
– Leave-one-out: k folds where k = # of tuples, for small sized
data
– *Stratified cross-validation*: folds are stratified so that
class dist. in each fold is approx. the same as that in the
initial data
Performance Indictors
Confusion Matrix:

Actual class\Predicted C1 ¬ C1
class
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix:


buy_computer = yes buy_computer = no Total
Actual class\Predicted class

buy_computer = yes 6954 46 7000


buy_computer = no 412 2588 3000
Total 7366 2634 10000

71
Performance Indicators
A\P C ¬C
◼ Class Imbalance Problem:
C TP FN P ◼ One class may be rare, e.g. fraud,
¬C FP TN N or HIV-positive
P’ N’ All ◼ Significant majority of the negative
class and minority of the positive
• Classifier Accuracy, or class
recognition rate: percentage ◼ Sensitivity: True Positive
of test set tuples that are recognition rate
correctly classified ◼ Sensitivity = TP/P
Accuracy = (TP + TN)/All
◼ Specificity: True Negative
recognition rate
• Error rate: 1 – accuracy, or
◼ Specificity = TN/N
Error rate = (FP + FN)/All

72
Performance Indicators
• Precision: exactness – what % that the classifier labeled as
positive are actually positive

• Recall: completeness – what % of the positives did the


classifier label as positive? (equals to sensitivity)

– Perfect score is 1.0


– In practice, inverse relationship between precision & recall

• F measure (F1 or F-score): harmonic mean of precision and


recall,

73
Performance Indicators
– Precision = 90/230 = 39.13%
– Recall = 90/300 = 30.00% = Sensitivity

cancer = cancer Total Recognition(%)


Actual Class\Predicted class
yes = no
cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56
(specificity)
Total 230 9770 10000 96.50 (accuracy)

74
Summary
• Supervised methods:
– Model observed data to predict future outcomes.
• Care must be taken in performing and
interpreting classification results
– How determine the best input variables and their
relationship to outcome variables.
– Understand and validate underlying assumptions.
– Transform variables when necessary.
– If in doubt, use a non-linear classification method
• Examples: Neural Nets, Naïve Bayes, SVM, …
Additional Classification Models
• Support Vector Machines
– Max-margin linear classifier, kernel trick.
• Supervised Neural Networks
– RNNs, Convolutional Networks, GNNs, …
• Bagging
– Bootstrap technique, ensemble method.
– N x weak learners -> vote on results (i.e. random forrest)
• Boosting
– Weighted combination, ensemble method.
– N x weak learners in series, each tasked to improve on the
previous.
Q&A

Images Courtesy of Google Image

You might also like