Data Warehouse and Mining
1
Data Warehouse and Mining – Unit 3
› Basic issues regarding classification and prediction
› Classification by Decision Tree,
› Bayesian classification,
› Classification by back propagation,
› Associative classification,
› Prediction:
› Statistical-Based Algorithms,
› Decision Tree -Based Algorithms,
› Neural Network –Based Algorithms,
› Rule-Based Algorithms,
› Other Classification Methods,
› Combining Techniques,
› Classifier Accuracy and Error Measures
2
Rule-based Classification
› Rules are a good way of representing information or bits of
knowledge. A rule-based classifier uses a set of IF-THEN rules for
classification. An IF-THEN rule is an expression of the form
› IF condition THEN conclusion.
› An example is rule R1,
R1: IF age = youth AND student = yes THEN buys_computer = yes.
› If the condition (i.e., all the attribute tests) in a rule antecedent
holds true for a given tuple, we say that the rule antecedent is
satisfied (or simply, that the rule is satisfied) and that the rule
covers the tuple.
› If a rule is satisfied by X, the rule is said to be triggered
3
› A rule R can be assessed by its coverage and accuracy.
Given a tuple, X, from a class labeled data set, D, let
ncovers be the number of tuples covered by R; ncorrect
be the number of tuples correctly classified by R; and jDj
be the number of tuples in D. We can define the coverage
and accuracy of R as
4
Rule based classification
› Example:
› Rule accuracy and coverage. Consider rule R1, which covers 2 of the 14 tuples.
› It can correctly classify both tuples. Therefore,
› coverage(R1) = 2/14 = 14.28% and accuracy(R1) = 2/2 =100%.
› That is, a rule’s coverage is the percentage of tuples that are covered by the rule
(i.e., their attribute values hold true for the rule’s antecedent). For a rule’s
accuracy, we look at the tuples that it covers and see what percentage of them
the rule can correctly classify.
5
› X=(age = youth, income = medium, student = yes, credit rating
= fair).
› We would like to classify X according to buys computer. X
satisfies R1, which triggers the rule.
› If R1 is the only rule satisfied, then the rule fires by returning
the class prediction for X.
› Note that triggering does not always mean firing because there
may be more than one rule that is satisfied! If more than one
rule is triggered, we have a potential problem.
› What if they each specify a different class? Or what if no rule is
satisfied by X?
6
Conflict resolution strategies in picking up the rules
› If more than one rule is triggered, we need a conflict resolution
strategy to figure out which rule gets to fire and assign its class
prediction to X.
› There are many possible strategies. We look at two, namely size
ordering and rule ordering.
› The size ordering scheme assigns the highest priority to the
triggering rule that has the “toughest” requirements, where
toughness is measured by the rule antecedent size. That is, the
triggering rule with the most attribute tests is fired.
7
› Rule-ordering: class-based ordering, rule-based ordering
› With class-based ordering, the classes are sorted in order of decreasing
“importance” such as by decreasing order of prevalence. That is, all the rules for
the most prevalent (or most frequent) class come first, the rules for the next
prevalent class come next, and so on. Alternatively, they may be sorted based
on the misclassification cost per class. Within each class, the rules are not
ordered—they don’t have to be because they all predict the same class (and so
there can be no class conflict).
› With rule-based ordering, the rules are organized into one long priority list,
according to some measure of rule quality, such as accuracy, coverage, or size
(number of attribute tests in the rule antecedent), or based on advice from
domain experts. When rule ordering is used, the rule set is known as a decision
list. With rule ordering, the triggering rule that appears earliest in the list has the
highest priority, and so it gets to fire its class prediction. Any other rule that
satisfies X is ignored. Most rule-based classification systems use a class-based
rule-ordering strategy.
› What if the rules are unordered? Go for a default rule.
8
Forming Rules from Decision trees
Sample:
If
age= youth &&
Student = no,
THEN
buys_computer = no
9
› A disjunction (logical OR) is implied between each of the extracted
rules. Because the rules are extracted directly from the tree, they
are mutually exclusive and exhaustive.
› Mutually exclusive means that we cannot have rule conflicts here
because no two rules will be triggered for the same tuple. (We have
one rule per leaf, and any tuple can map to only one leaf.)
› Exhaustive means there is one rule for each possible attribute–value
combination, so that this set of rules does not require a default
rule. Therefore, the order of the rules does not matter—they are
unordered.
› So, in the previous tree, which are mutually exclusive and which are
exhaustive?
10
Classification by Backpropagation
› Backpropagation is a neural network learning algorithm
› Multilayer Feed-forward NN: A multilayer feed-forward neural network consists of
an input layer, one or more hidden layers, and an output layer.
• Input units
• Input layer
• Weights
• Neurodes:
• Hidden layer(s)
• Output layer
• Fully connected NN
11
Defining a network topology
› Before training can begin, the user must decide on the
network topology by specifying the number of units in the
input layer, the number of hidden layers (if more than
one), the number of units in each hidden layer, and the
number of units in the output layer.
12
Backpropagation
› Backpropagation learns by iteratively processing a data set of training tuples,
comparing the network’s prediction for each tuple with the actual known target
value. The target value may be the known class label of the training tuple (for
classification problems) or a continuous value (for numeric prediction).
› For each training tuple, the weights are modified so as to minimize the mean-
squared error between the network’s prediction and the actual target value.
› These modifications are made in the “backwards” direction (i.e., from the output
layer) through each hidden layer down to the first hidden layer (hence the name
backpropagation).
› Although it is not guaranteed, in general the weights will eventually converge,
and the learning process stops.
13
Backpropagation algorithm
› Initialize the weights: The weights in the network are initialized to small random
numbers (e.g., ranging from 1.0 to 1.0, or 0.5 to 0.5). Each unit has a bias
associated with it. The biases are similarly initialized to small random numbers.
› Each training tuple, X, is processed by the following steps:
› Propagate the inputs forward: First, the training tuple is fed to the network’s input
layer. The inputs pass through the input units, unchanged. That is, for an input
unit, j, its output, Oj , is equal to its input value, Ij .
› Next, the net input and output of each unit in the hidden and output layers are
computed. The net input to a unit in the hidden or output layers is computed as a
linear combination of its inputs.
› Each such unit has a number of inputs to it that are, in fact, the outputs of the
units connected to it in the previous layer. Each connection has a weight. To
compute the net input to the unit, each input connected to the unit is multiplied by
its corresponding weight, and this is summed.
14
Backpropagation algorithm
› Given a unit, j in a hidden or output layer, the net input, Ij , to unit j is
› where wij is the weight of the connection from unit i in the previous layer to unit j; Oi is the output of
unit i from the previous layer; and j is the bias of the unit. The bias acts as a threshold in that it
serves to vary the activity of the unit.
› Each unit in the hidden and output layers takes its net input and then applies an activation function
to it. The function symbolizes the activation of the neuron represented by the unit. The logistic, or
sigmoid, function is used.
› Given the net input Ij to unit j, then Oj , the output of unit j, is computed as
› This function is also referred to as a squashing function, because it maps a large input domain onto
the smaller range of 0 to 1. The logistic function is nonlinear and differentiable, allowing the
backpropagation algorithm to model classification problems that are linearly inseparable.
› We compute the output values, Oj , for each hidden layer, up to and including the output layer, which
gives the network’s prediction.
15
16
› Backpropagate the error: The error is propagated backward by updating the
weights and biases to reflect the error of the network’s prediction. For a unit j in
the output layer, the error Errj is computed by
› where Oj is the actual output of unit j, and Tj is the known target value of the
given training tuple. Note that Oj(1-Oj) is the derivative of the logistic function.
› To compute the error of a hidden layer unit j, the weighted sum of the errors of
the units connected to unit j in the next layer are considered. The error of a
hidden layer unit j is
› where wjk is the weight of the connection from unit j to a unit k in the next higher
layer, and Errk is the error of unit k.
› The weights and biases are updated to reflect the propagated errors.
› Changes in weights are represented by and calculated as:
17
› ‘l’ is the learning rate, falling between 0.0 – 1.0
› A rule of thumb is to set the learning rate to 1=t , where t is the number of
iterations through the training set so far.
› Biases are updated by the following equations, where j is the change in bias j
:
› The weights and biases are updated after the presentation of each tuple. This is
referred to as case updating.
› Alternatively, the weight and bias increments could be accumulated in variables,
so that the weights and biases are updated after all the tuples in the training set
have been presented. This latter strategy is called epoch updating, where one
iteration through the training set is an epoch.
18
› Terminating condition: Training stops when All wij in the
previous epoch are so small as to be below some
specified threshold; or
› The percentage of tuples misclassified in the previous
epoch is below some threshold; or
› A prespecified number of epochs has expired.
› ASSIGNMENT: Work out the example given in book.
19
Associative classification
› Steps:
1. Mine the data for frequent itemsets, that is, find commonly
occurring attribute–value pairs in the data.
2. Analyze the frequent itemsets to generate association rules per
class, which satisfy confidence and support criteria.
3. Organize the rules to form a rule-based classifier.
› Three types:
– Classification Based on Association(CBA)
– Classification Based on Multiple Rules(CMAR)
– Classification Based on Predictive Association Rules(CPAR)
20