SE-6104 Data Mining and Analytics: Lecture # 13 Advance Classification
SE-6104 Data Mining and Analytics: Lecture # 13 Advance Classification
Chapter 9
Advance Classification
Techniques to Improve Classification Accuracy
Ensemble Methods: Increasing the Accuracy
• Ensemble methods
• Use a combination of models to increase accuracy
• Combine a series of k learned models, M1, M2, …, Mk, with
the aim of creating an improved model M*
• Popular ensemble methods
• Bagging: averaging the prediction over a collection of
classifiers
• Boosting: weighted vote with a collection of classifiers
• Ensemble: combining a set of heterogeneous classifiers
2
Bagging: Boostrap Aggregation
• Analogy: Diagnosis based on multiple doctors’ majority vote
• Training
• Given a set D of d tuples, at each iteration i, a training set Di of d tuples is
sampled with replacement from D (i.e., bootstrap)
• A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
• Each classifier Mi returns its class prediction
• The bagged classifier M* counts the votes and assigns the class with the
most votes to X
• Prediction: can be applied to the prediction of continuous values by taking
the average value of each prediction for a given test tuple
• Accuracy
• Often significantly better than a single classifier derived from D
• For noise data: not considerably worse, more robust
• Proved improved accuracy in prediction
3
Boosting
• Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
• How boosting works?
• Weights are assigned to each training tuple
• A series of k classifiers is iteratively learned
• After a classifier Mi is learned, the weights are updated to allow
the subsequent classifier, Mi+1, to pay more attention to the
training tuples that were misclassified by Mi
• The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy
• Boosting algorithm can be extended for numeric prediction
• Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data
4
Adboost (Freund and Schapire, 1997)
• Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
• Initially, all the weights of tuples are set the same (1/d)
• Generate k classifiers in k rounds. At round i,
• Tuples from D are sampled (with replacement) to form a training set Di of
the same size
• Each tuple’s chance of being selected is based on its weight index.
• A classification model Mi is derived from Di e.g using GINI
• Its error rate is calculated using Di as a test set e.g total error associated
with classiers.
• If a tuple is misclassified, its weight is increased, o.w. it is decreased
• Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error rate
is the sum of the weights of the misclassified
d tuples:
error ( M i ) w j err ( X j )
j
1 error ( M i )
log
error ( M i )
• (Amount of say): The weight of classifier Mi’s vote is 1/2 5
Adboost (Freund and Schapire, 1997)
6
Random Forest
• Random Forest:
( Breiman 2001)
• Each classifier in the ensemble is a decision tree classifier and is
generated using a random selection of attributes at each node to
determine the split
• During classification, each tree votes and the most popular class is
returned
Method to construct Random Forest:
• Forest-RI (random input selection): Randomly select, at each node, F
attributes as candidates for the split at the node.
• The CART methodology is used to grow the trees to maximum size
• Comparable in accuracy to Adaboost, but more robust to errors and
outliers
• Insensitive to the number of attributes selected for consideration at each
split, and faster than bagging or boosting
7
Classification of Class-Imbalanced Data Sets
• Class-imbalance problem: Rare positive example but numerous
negative ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc.
• Traditional methods assume a balanced distribution of classes and
equal error costs: not suitable for class-imbalanced data
• Typical methods for imbalance data in 2-class classification : for
improving the classification accuracy of class-imbalanced data.
• Oversampling: re-sampling of data from positive class
• Under-sampling: randomly eliminate tuples from negative class
• Threshold-moving: moves the decision threshold, t, so that the
rare class tuples are easier to classify, and hence, less chance of
costly false negative errors
• Ensemble techniques: Ensemble multiple classifiers introduced
above
• Still difficult for class imbalance problem on multiclass tasks
8
Classification by Backpropagation
(CH # 9.2)
• Brain
• A marvelous piece of
architecture and design.
• In association with a
nervous system, it
controls the life patterns,
communications,
interactions, growth and
development of
hundreds of million of life
forms.
• There are about 1010 to 1014 nerve cells (called
neurons) in an adult human brain.
• Neurons are highly connected with each other.
Each nerve cell is connected to hundreds of
thousands of other nerve cells.
• Passage of information between neurons is
slow (in comparison to transistors in an IC). It
takes place in the form of electrochemical
signals between two neurons in milliseconds.
• Energy consumption per neuron is low
(approximately 10-6 Watts).
Look more like some
spots of ink… aren’t they!
Axons from
another
neurons Cell Body
Synapse Dendrites
w21
X1 Y1
w31
w12
X2 w22 Y2
l w32 l
l l
l l
w13
Xn w2m Ym
wnm
Input Output
Units Units
w11 v11
X1 Y1
wi1 vj1
Z1
l wn1 l
vp1 l
l l
l l l
w1j l v1k
Xi wij Zj vjk Yk
wnj l vpk
Biological l
l
l l
Neurons
l w1p v1m l
l l
Zp
In Action wip vjm
Xn wnp vpm
Ym
Input Hidden Output
Units Units Units
1 w11 1
X1 Y1
w1n
v1n v1m
w1m
Xn Ym
wnm
1 1
Supervised Training
• Training is accomplished by presenting a sequence of
training vectors or patterns, each with an associated
target output vector.
• The weights are then adjusted according to a learning
algorithm.
• During training, the network develops an associative
memory. It can then recall a stored pattern when it is
given an input vector that is sufficiently similar to a vector
it has learned.
Unsupervised Training
• A sequence of input vectors is provided, but no traget
vectors are specified in this case.
• The net modifies its weights and biases, so that the most
similar input vectors are assigned to the same output unit.
Classification by Backpropagation
• Backpropagation: A neural network learning algorithm
• Started by psychologists and neurobiologists to develop and test
computational analogues of neurons
• A neural network: A set of connected input/output units where
each connection has a weight associated with it
• During the learning phase, the network learns by adjusting the
weights so as to be able to predict the correct class label of the
input tuples
• Also referred to as connectionist learning due to the connections
between units
21
Neural Network as a Classifier
• Weakness
• Long training time
• Require a number of parameters typically best determined empirically,
e.g., the network topology or “structure."
• Poor interpretability: Difficult to interpret the symbolic meaning behind
the learned weights and of “hidden units" in the network
• Strength
• High tolerance to noisy data
• Ability to classify untrained patterns
• Well-suited for continuous-valued inputs and outputs
• Successful on a wide array of real-world data
• Algorithms are inherently parallel
• Techniques have recently been developed for the extraction of rules from
trained neural networks
22
A Neuron (= a perceptron)
- j
x0 w0
x1 w1
å f
output y
xn wn
For Example
n
Input weight weighted Activation y sign( wi xi j )
vector x vector w sum function i0
23
A Multi-Layer Feed-Forward Neural Network
• Given the net input Ij to unit j, then Oj , the output of unit j, is computed as
1
Oj I j
1 e
24
• Backpropagate the error: The error is propagated backward by updating the
weights and biases to reflect the error of the network’s prediction. For a unit j in
the output layer, the error Errj is computed by
Err j O j (1 O j )(T j O j )
where Oj is the actual output of unit j, and Tj is the known target value of the
given training tuple.
• The error of a hidden layer unit j is
Err j O j (1 O j ) Errk w jk
k
where wjk is the weight of the connection from unit j to a unit k in the next
higher layer, and Errk is the error of unit k.
• Weights are updated by the following equations, where is the change in
weight wij
wij wij (l ) Err j Oi
• Biases are updated by the following equations, where is the change in
bias weight
j j (l) Err j
25
How A Multi-Layer Neural Network Works?
• The inputs to the network correspond to the attributes measured for each
training tuple
• Inputs are fed simultaneously into the units making up the input layer
• They are then weighted and fed simultaneously to a hidden layer
• The number of hidden layers is arbitrary, although usually only one
• The weighted outputs of the last hidden layer are input to units making up
the output layer, which emits the network's prediction
• The network is feed-forward in that none of the weights cycles back to an
input unit or to an output unit of a previous layer
26
Defining a Network Topology
• First decide the network topology: # of units in the input layer, #
of hidden layers (if > 1), # of units in each hidden layer, and # of
units in the output layer
• Normalizing the input values for each attribute measured in the
training tuples to [0.0—1.0]
• One input unit per domain value, each initialized to 0
• Output, if for classification and more than two classes, one
output unit per class is used
• Once a network has been trained and its accuracy is
unacceptable, repeat the training process with a different
network topology or a different set of initial weights
27
Backpropagation
• Iteratively process a set of training tuples & compare the network's prediction
with the actual known target value
• For each training tuple, the weights are modified to minimize the mean
squared error between the network's prediction and the actual target value
• Modifications are made in the “backwards” direction: from the output layer,
through each hidden layer down to the first hidden layer, hence
“backpropagation”
• Steps
• Initialize weights (to small random #s) and biases in the network
• Propagate the inputs forward (by applying activation function)
• Backpropagate the error (by updating weights and biases)
• Terminating condition (when error is very small, etc.)
28
Example to discuss
Terminating condition:
• Training stops when
• All wij in the previous epoch are so small as to be below some
specified threshold, or
• The percentage of tuples misclassified in the previous epoch is
below some threshold, or
• A prespecified number of epochs has expired.
• In practice, several hundreds of thousands of epochs may be
required before the weights will converge.
32
Backpropagation and Interpretability
• Efficiency of backpropagation: Each epoch (one interation through the training
set) takes O(|D| * w), with |D| tuples and w weights, but # of epochs can be
exponential to n, the number of inputs, in the worst case
• Rule extraction from networks: network pruning
• Simplify the network structure by removing weighted links that have the
least effect on the trained network
• The set of input and activation values are studied to derive rules
describing the relationship between the input and hidden unit layers
• Sensitivity analysis: measure the impact that a given input variable has on a
network output. The knowledge gained from this analysis can be represented
in rules
33