CSCI446/946 Big Data Analytics
Week 5 – Lecture: Classification
School of Computing and Information Technology
University of Wollongong Australia
Spring 2024
Content
• Brief Recap
– Clustering Analysis
• K-means, DBSCAN, SOM
• Classification
– Overview
– K-Nearest Neighbor (KNN)
– Multi-Layers Perceptron (MLP)
– Decision Tree (DT)
– Naïve Bayesian Classifier
• Diagnostics and Performance Indicators
Content
• Brief Recap
– Clustering Analysis
• K-means, DBSCAN, SOM
• Classification
– Overview
– K-Nearest Neighbor (KNN)
– Multi-Layers Perceptron (MLP)
– Decision Tree (DT)
– Naïve Bayesian Classifier
• Diagnostics and Performance Indicators
K-means Clustering
18/08/2024 4
K-means Clustering
• Application to image processing
K-means Clustering
• Application to image processing
Original K=2
K=3
K=10
DBScan
• Given a density threshold (𝑀𝑖𝑛𝑃𝑡𝑠) and a radius (𝐸𝑝𝑠),
the points in a dataset are classified into three types:
core point, border point, and noise point.
– Core points: 𝑃𝑜𝑖𝑛𝑡 𝑤ℎ𝑜𝑠𝑒 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 >= 𝑀𝑖𝑛𝑃𝑡𝑠
– Core points are in the interior of a density-based cluster.
Example: If 𝑀𝑖𝑛𝑃𝑡𝑠 =
6 then A is a core point
because its density = 7
(7>6)
DBScan Example
Original Points
Eps = 10, MinPts = 4
Mark core, border and noise points Mark connected core points
Self-Organizing Maps
• Self-organizing maps have two
layers:
– An input layer and
– An output layer called the feature
map.
• The feature map consists of
neurons.
– organized on a regular grid.
– Unlike other ANN types, the neurons
in a SOM don’t have an activation
function.
• Each neuron in a SOM is assigned a
weight vector with the same
dimensionality as the input space.
Self-Organizing Maps
• SOMs are an excellent choice for data
visualization
– Dimension reduction techniques
• Why use Self-Organizing Maps (SOMs) in BDA?
– Topology preservation (unlike PCA)
– Able to deal with new data & missing values (unlike t-SNE)
• When not to use SOMs in BDA:
– When the data is very sparse
– When cardinality (limited resolution) of the map is a
problem.
Content
• Brief Recap
– Clustering Analysis
• K-means, DBSCAN, SOM
• Classification
– Overview
– K-Nearest Neighbor (KNN)
– Multi-Layers Perceptron (MLP)
– Decision Tree (DT)
– Naïve Bayesian Classifier
• Diagnostics and Performance Indicators
Overview of Classification
• Classification is a fundamental learning
method that appears in applications related to
data mining
• The primary task performed by classifiers is to
assign class labels to new observations
– Sets: training, (validation), testing
• Classification methods are supervised
– Start with a training set of labelled observations
– Predict the outcome for new observations
Overview of Classification
• Example of classifiers:
– K-nearest neighbour (KNN): model free classifier
– Neural Networks (NN): Massive parallel nonlinear
parametric methods
– Decision Tree and Random Forests: Makes explanatory
if-then decisions
– Naïve Bayes (NB) Classifier: Probabilistic methods
– Logistic regression: Linear method (LR)
– Support Vector Machines (SVM): non-parametric
classifiers.
–…
Nearest-Neighbor Classifiers
Unknown record Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve
To classify an unknown record:
– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
Definition of Nearest Neighbor
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbors of a record x are data points
that have the k smallest distance to x
Nearest Neighbor Classification
• Compute distance between two points:
– Euclidean distance
d ( p, q ) = ( pi
i
−q )
i
2
• Determine the class from nearest neighbor list
– take the majority vote of class labels among the k-
nearest neighbors
– Weigh the vote according to distance
• weight factor, w = 1/d2
Nearest Neighbor Classification…
• Choosing the value of k:
– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points from
other classes
– Computational cost often increases when k increases
X
Neural Networks - MLP
Neural Networks - MLP
Neural Networks - MLP
Bias inputs are not shown
Neural Networks - MLP
• Weights are initially unknown
– Initialized with small values
– Are updated by a learning algorithm.
• NN can produce a non-linear mapping of the
input to the output.
– Coding of attribute values is non-critical as long as
the inputs are numeric.
– Inputs for NN are often normalized. Why?
– Few exceptions: i.e. SOM
Neural Networks - MLP
• Main challenges:
– Network design:
• How to organize the neurons?
• How many layers, how many neurons in each layer?
• Which activation function?
– Learning algorithm:
• How to update the weights?
• How to update the weights effectively?
Neural Networks - MLP
• It has been proven:
– Three layers are enough (if neurons are linear)
– Two layers are enough (if neurons are non-linear).
• Common activation functions:
Neural Networks - MLP
• Weight updates
– Compute the network error 𝐸 = σ𝑛𝑖 𝑜𝑖 − 𝑡𝑖 2 ,where 𝑜𝑖 is the i-th
network output, and 𝑡𝑖 the desired network output (the target).
• When updating the weights, the aim is to minimize 𝐸 for all inputs.
• Many algorithms are based on gradient descent methods.
𝛿𝐸
• Update weights: Δ𝑤𝑖𝑗 = −𝛼 , where 𝛼 ∈ 0,1 is a learning
𝛿𝑤𝑖𝑗
rate.
• Repeat for a number of epochs:
– Select a training sample
– Compute the output, then compute the error.
– Compute the gradient then update the weights.
Neural Networks – MLP vs DNN
• Formally it has been proven:
– Three layers are enough (if neurons are linear)
– Two layers are enough (if neurons are non-linear)
• A surprise: Deep Neural Networks
– For complex problems it was found that deep NN are
much better.
• Many layers (possibly hundreds)
• CNN (1995)
• Breakthrough after 2000 (massive parallel GPUs)
Neural Networks in R
• Example: A training dataset
• Can a neural network predict placement given knowledge
score and communication score of a student?
Neural Networks in R
Neural Networks in R
Neural Networks in R
• Your results may vary. Why?
• We expected results such as 0s and 1s. What to do?
Neural Networks in R
Reason to choose
• Neural Nets are massive parallel systems
– Can be implemented efficiently on multi-core (i.e. GPU)
systems.
– Trained models are computationally very efficient when
processing new inputs.
• Neural Nets can solve a wide range of problems, and can
classify samples into an arbitrary number of classes.
– NN perform better than humans on growing number of tasks
(i.e. playing chess, Go, lip reading,...)
• Limited data pre-processing required.
• Insensitive to noise
• Often a tool of choice in Big Data Analytics.
Caution
• Most supervised Neural Networks are “black box” classifiers.
– They are unable to show or explain how a result came to be.
– i.e. what in the input caused the network to respond in a certain
way?
• They have problems with unbalanced learning problems.
– i.e. when there are many more samples in one class than in
another class.
• The model is prone to overfit the training data when choosing too
many neurons and/or layers.
– Performance may be sub-optimal if choosing to few neurons or
layers.
– Finding the best number of neurons and layers is an art.
• Training can be time consuming.
– They tend to require a lot of training samples to perform well.
Decision Tree
• A decision tree uses a tree structure to specify
sequences of decisions and consequences
• Given input variable X = {x1,x2,…,xn}, the goal is
to predict an output variable Y
Decision Tree
• Each node tests a particular input variable
• Each branch represents the decision made
• Classifying a new observation is to traverse
this decision tree.
Decision Tree
• The depth of a node is the minimum number
of steps required to reach the node from root
• Leaf nodes are at the end of the last branches
on the tree, representing class labels
The General Algorithm of DT
• The objective of a decision tree algorithm
– Construct a tree T from a training set S
• The algorithm picks the most informative
attribute to branch the tree and does this
recursively for each of the sub-trees.
• The most informative attribute is identified by
– Information gain, calculated based on Entropy
The General Algorithm of DT
• Entropy
Question:
In a bank marketing dataset, there are 2000
customers in total. Among them, 1789
subscribed term deposit. What is the entropy of
the output variable “subscribed” (Hsubscribed)?
The General Algorithm of DT
• Conditional entropy
The General Algorithm of DT
• Information gain
• It compares
– The degree of purity of the parent node before a split
– The degree of purity of the child node after a split
The General Algorithm of DT
• The algorithm constructs sub-trees recursively
until one of the following criteria is met
– All the leaf nodes in the tree satisfy the minimum
purity threshold (i.e., are pure enough)
– There is no sufficient information gain by splitting
on more attribute (i.e., not worth anymore)
– Any other stopping criterion is satisfied (such as
the maximum depth of the tree)
Decision Tree
• An example: A bank markets its term deposit
product. So the bank needs to predict which
clients would subscribe to a term deposit
– The bank collects a dataset of 2000 previous
clients with known “subscribe or not”.
– Input variables to describe each client are
• Job, marital status, education level, credit default,
housing loan, personal loan, contact type, previous
campaign contact
Decision Tree
…
The training dataset of the bank example
Decision Tree
From your point of view,
what is the most
important issue in
building a decision tree?
A decision tree built over the bank marketing training dataset
The General Algorithm of DT
• Assume the attribute X is “contact”
– Its value x takes one value in {cellular,
telephone, unknown}
• The outcome Y is “subscribed”
– Its value y takes one value in {no, yes}
The General Algorithm of DT
The General Algorithm of DT
• The algorithm splits on the attribute with the
largest information gain at each round
The General Algorithm of DT
• Information gain
• It compares
– The degree of purity of the parent node before a split
– The degree of purity of the child node after a split
Properties of Decision Tree
• Computationally inexpensive, easy to classify
• Classification rules can be understood
• Handle both numerical and categorical input
• Handle variables that have a nonlinear effect
on the outcome, better than linear models
• Not a good choice if there are many irrelevant
input variables
– Feature selection will be needed
Caution
• Decision tree uses greedy algorithms
– It always chooses the option that seems the best
available at that moment
– However, the option may not be the best overall
and this could cause overfitting
– An ensemble technique can address this issue by
combining multiple decision trees that use
random splitting
Evaluating a Decision Tree
• Evaluate a decision tree
– Evaluate whether the splits of the tree make sense
and whether the decision rules are sound (say,
with domain experts)
– Having too many layers and obtaining nodes with
few members might be signs of overfitting
– Use standard diagnostics tools for classifiers
Naïve Bayes Classifier
• A probabilistic classification method based on
Bayes’ theorem
• A naïve Bayes classifier assumes that the
presence or absence of a particular feature of
a class is unrelated to the presence or absence
of other features (conditional independence assumption)
• Output includes a class label and its
corresponding probability score
Naïve Bayes Classifier
on Bayes’ Theorem
Thomas Bayes
1702-1761
Bayes’ Theorem
• A more practical form of Bayes’ theorem
• Given A, how to calculate P(ci|A)?
Naïve Bayes Classifier
• With two simplifications, Bayes’ theorem
induces a Naïve Bayes classifier
• First, Conditional independence assumption
– Each attribute is conditionally independent of
every other attribute given a class label ci
– This simplifies the computation of P(A|ci)
Naïve Bayes Classifier
• Second, ignore the denominator P(A)
– Removing the denominator has no impact on the
relative probability scores
• In this way, the classifier becomes
Caution
• An issue on rare event
– What if one of the attribute values does NOT
appear in a class ci in a training dataset?
– P(aj|ci) for this attribute value will equal zero!
– P(ci|A) will simply become zero!
• Smoothing technique
– It assigns a small nonzero probability to rare
events not included in a training dataset
Naïve Bayes Classifier
• Laplace smoothing (add-one smoothing)
– It pretends to see every outcome once more than
it actually appears
Naïve Bayes Classifier
• Advantages
– Simple to implement, commonly used for text
classification
– Handle high-dimensional data efficiently
– Robust to overfitting with smoothing technique
• Disadvantages
– Sensitive to correlated variables (Why?)
– Not reliable for probability estimation
Naïve Bayes Classifier
• An example
– With the bank marketing dataset, use Naïve Bayes
Classifier to predict if a client would subscribe to a term
deposit
• Building a Naïve Bayes classifier requires to calculate
some statistics from training dataset
– P(A|ci) for each class i=1,2,…,n
– P(aj|ci) for each attribute j=1,2,…,m in each class
Naïve Bayes Classifier
• P(A|ci) for each class
• P(aj|ci) for each attribute
in each class
Naïve Bayes Classifier
• Testing a Naïve Bayes classifier on a new data
Naïve Bayes in R
• Two methods
– Build the classifier from the scratch
– Call naiveBayes function from e1071 package
Naïve Bayes in R
Content
• Brief Recap
– Clustering Analysis
• K-means, DBSCAN, SOM
• Classification
– Overview
– K-Nearest Neighbor (KNN)
– Multi-Layers Perceptron (MLP)
– Decision Tree (DT)
– Naïve Bayesian Classifier
• Diagnostics and Performance Indicators
Diagnostics
• Holdout method
– Given data is randomly partitioned into two independent
sets
• Training set (e.g., 80%) & Test set (e.g., 20%)
– Random sampling: a variation of holdout
• Repeat holdout k times, avg. + std accuracy
• Cross-validation (k-fold, where k = 10 is most popular)
– Randomly partition the data into k mutually exclusive
subsets, each approximately equal size
– At i-th iteration, use Di as test set and others as training set
– Leave-one-out: k folds where k = # of tuples, for small sized
data
– *Stratified cross-validation*: folds are stratified so that
class dist. in each fold is approx. the same as that in the
initial data
Performance Indictors
Confusion Matrix:
Actual class\Predicted C1 ¬ C1
class
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
Example of Confusion Matrix:
buy_computer = yes buy_computer = no Total
Actual class\Predicted class
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000
71
Performance Indicators
A\P C ¬C
◼ Class Imbalance Problem:
C TP FN P ◼ One class may be rare, e.g. fraud,
¬C FP TN N or HIV-positive
P’ N’ All ◼ Significant majority of the negative
class and minority of the positive
• Classifier Accuracy, or class
recognition rate: percentage ◼ Sensitivity: True Positive
of test set tuples that are recognition rate
correctly classified ◼ Sensitivity = TP/P
Accuracy = (TP + TN)/All
◼ Specificity: True Negative
recognition rate
• Error rate: 1 – accuracy, or
◼ Specificity = TN/N
Error rate = (FP + FN)/All
72
Performance Indicators
• Precision: exactness – what % that the classifier labeled as
positive are actually positive
• Recall: completeness – what % of the positives did the
classifier label as positive? (equals to sensitivity)
– Perfect score is 1.0
– In practice, inverse relationship between precision & recall
• F measure (F1 or F-score): harmonic mean of precision and
recall,
73
Performance Indicators
– Precision = 90/230 = 39.13%
– Recall = 90/300 = 30.00% = Sensitivity
cancer = cancer Total Recognition(%)
Actual Class\Predicted class
yes = no
cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56
(specificity)
Total 230 9770 10000 96.50 (accuracy)
74
Summary
• Supervised methods:
– Model observed data to predict future outcomes.
• Care must be taken in performing and
interpreting classification results
– How determine the best input variables and their
relationship to outcome variables.
– Understand and validate underlying assumptions.
– Transform variables when necessary.
– If in doubt, use a non-linear classification method
• Examples: Neural Nets, Naïve Bayes, SVM, …
Additional Classification Models
• Support Vector Machines
– Max-margin linear classifier, kernel trick.
• Supervised Neural Networks
– RNNs, Convolutional Networks, GNNs, …
• Bagging
– Bootstrap technique, ensemble method.
– N x weak learners -> vote on results (i.e. random forrest)
• Boosting
– Weighted combination, ensemble method.
– N x weak learners in series, each tasked to improve on the
previous.
Q&A
Images Courtesy of Google Image