3 DM Classification
3 DM Classification
1
DM Task: Predictive Modeling
• A predictive model makes a prediction/forecast about
values of data using known results found from different
historical data
– Prediction Methods use existing variables to predict unknown
or future values of other variables.
• Predict one variable Y given a set of other variables X.
Here X could be an n-dimensional vector
– In effect this is function approximation through learning the
relationship between Y and X
• Many, many algorithms for predictive modeling in
statistics and machine learning, including
– Classification, regression, etc.
• Often the emphasis is on predictive accuracy, less
emphasis on understanding the model 2
Prediction Problems:
Classification vs. Numeric Prediction
• Classification
– predicts categorical class labels (discrete or nominal)
– classifies data (constructs a model) based on the
training set and the values (class labels) in a classifying
attribute and uses it in classifying new data
• Numeric Prediction
– models continuous-valued functions, i.e., predicts
unknown or missing values
3
Models and Patterns x f(x)
• Model = abstract representation of a given
training data 1 1
e.g., very simple linear model structure
Y=aX+b
2 4
– a and b are parameters determined from the data 3 9
– Y = aX + b is the model structure
– Y = 0.9X + 0.3 is a particular model 4 16
• Pattern represents “local structure” in a dataset
5 ?
– E.g., if X>x then Y >y with probability p
• Example: Given a finite sample, <x,f(x)> pairs, create a model
that can hold for future values?
To guess the true function f, find some pattern (called a
hypothesis) in the training examples, and assume that the
pattern will hold for future examples too.
5
Predictive Modeling: Customer Scoring
• Example: a bank has a database of 1 million past
customers, 10% of whom took out mortgages
• Use machine learning to rank new customers as a
function of p(mortgage | customer data)
• Customer data
– History of transactions with the bank
– Other credit data (obtained from Experian, etc)
– Demographic data on the customer or where they live
• Techniques
– Binary classification: logistic regression, decision trees,
etc
– Many, many applications of this nature 6
Classification
• Example:
– Fair-Isaac/HNC’s fraud detection software based on neural
networks, led to reported fraud decreases of 30 to 50%
(http://www.fairisaac.com/fairisaac)
• Issues
– Significant feature engineering/preprocessing
– false alarm rate vs. missed detection – what is the tradeoff? 8
DM Task: Descriptive Modeling
• Goal is to build a “descriptive” model that models the underlying
observation
– e.g., a model that could simulate the data if needed
EM ITERATION 25
properties
3.8
3.7
3.3 3.4 3.5 3.6 3.7 3.8 3.9 4
Red Blood Cell Volume
10
Pattern (Association Rule) Discovery
• Goal is to discover interesting “local” patterns
(sequential patterns) in the data rather than to
characterize the data globally
– Also called link analysis (uncovers relationships
among data)
11
Example of Pattern Discovery
• Example in retail: Customer transactions to consumer behavior:
– People who bought “Da Vinci Code” also bought “The Five
People You Meet in Heaven” (www.amazon.com)
• Example: football player behavior
– If player A is in the game, player B’s scoring rate increases
from 25% chance per game to 95% chance per game
• What about the following?
ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDABABBCDDDCDDABDC
BBDBDBCBBABBBCBBABCBBACBBDBAACCADDADBDBBCBBCCBBBDCABDDBBADDBBBB
CCACDABBABDDCDDBBABDBDDBDDBCACDBBCCBBACDCADCBACCADCCCACCDDADCBC
ADADBAACCDDDCBDBDCCCCACACACCDABDDBCADADBCBDDADABCCABDAACABCABAC
BDDDCBADCBDADDDDCDDCADCCBBADABBAAADAAABCCBCABDBAADCBCDACBCABABC
CBACBDABDDDADAABADCDCCDBBCDBDADDCCBBCDBAADADBCAAAADBDCADBDBBBCD
CCBCCCDCCADAADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBCB
DBDBADBBBBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCCDBAAADDD
BDDCABACBCADCDCBAAADCADDADAABBACCBB
12
Basic Data Mining algorithms
• Classification (which is also called Supervised learning) maps
data into predefined groups or classes to enhance the
prediction process
• Clustering (which is also called Unsupervised learning )
groups similar data together into clusters.
– is used to find appropriate groupings of elements for a set of
data.
– Unlike classification, clustering is a kind of undirected knowledge
discovery or unsupervised learning; that is, there is no target
field, and the relationship among the data is identified by
bottom-up approach.
• Association Rule (is also known as market basket analysis)
– It discovers interesting associations between attributes
contained in a database.
– Based on frequency counts of the number of items occur in the
event, association rule tells if item X is a part of the event, then
what is the percentage of item Y is also part of the event. 14
Classification
15
Classification: Definition
• Classification is a data mining (machine learning) technique
used to predict group membership for data instances.
• Given a collection of records (training set), each record
contains a set of attributes, one of the attributes is the class.
– Find a model for class attribute as a function of the values of
other attributes.
• Goal: previously unseen records should be assigned a class as
accurately as possible. A test set is used to determine the
accuracy of the model.
– Usually, the given data set is divided into training and test sets,
with training set used to build the model and test set used to
validate it.
• For example, one may use classification to predict whether the
weather on a particular day will be “sunny”, “rainy” or “cloudy”.
16
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision trees,
or mathematical formulae
• Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test sample is compared with the
classified result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set
– If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
17
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
No
1 Yes Large 125K
algorithm
2 No Medium 100K No
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Confusion Matrix for Performance Evaluation
PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes a b
CLASS (TP) (FP)
Class=No c d
(FP) (TP)
Dist ( X , Y ) ( Xi Yi )
i 1
23
Example
• We have data from the questionnaires survey (to ask
people opinion) and objective testing with two attributes
(acid durability and strength) to classify whether a special
paper tissue is good or not. Here is four training samples.
Y = Classification
X1 = Acid Durability (seconds) X2 = Strength (kg/m2)
7 7 Bad
7 4 Bad
3 4 Good
1 4 Good
• Now the factory produces a new paper tissue that pass
laboratory test with X1 = 3 and X2 = 7.
– Without undertaking another expensive survey, guess the
goodness of the new tissue? Use squared Euclidean distance for
similarity measurement.
Solution
X1 = Acid X2 = Square Distance Rank Is it Y=
Durability Strength to query instance minimum included Category
(seconds) (kg/m2) (3, 7) distance in 3- of NN
NNs?
7 7 3 Yes Bad
7 4 4 No -
3 4 1 Yes Good
1 4 2 Yes Good
• Use simple majority of the category of nearest neighbors as the prediction value
of the query instance. We have 2 good and 1 bad, since 2>1 then we conclude
that a new paper tissue that pass laboratory test with X1 = 3 and X2 = 7 is
included in Good category.
KNNs: advantages & Disadvantages
• Advantage
– Nonparametric architecture
– Simple
– Powerful
– Requires no training time
• Disadvantage: Difficulties with k-nearest neighbour
algorithms
– Memory intensive: just store the training examples
• when a test example is given then find the closest matches
– Classification/estimation is slow
– Have to calculate the distance of the test case from all training
cases
– There may be irrelevant attributes amongst the attributes –
curse of dimensionality
27
Decision Tree
29
Decision Trees
• Decision tree constructs a tree where internal nodes
are simple decision rules on one or more attributes
and leaf nodes are predicted class labels.
Given an instance of an object or situation, which is
specified by a set of properties, the tree returns a
"yes" or "no" decision about that instance.
Attribute_1
value-1 value-3
value-2
Attribute_2 Class1 Attribute_2
• Information Gain
– Select the attribute with the highest information gain, that
create small average disorder
• First, compute the disorder using Entropy; the expected
information needed to classify objects into classes
• Second, measure the Information Gain; to calculate by how
much the disorder of a set would reduce by knowing the value
of a particular attribute.
Entropy
• The Entropy measures the disorder of a set S containing a
total of n examples of which n+ are positive and n- are
negative and it is given by:
n n n n
D(n , n ) log 2 log 2 Entropy ( S )
n n n n
• Some useful properties of the Entropy:
– D(n,m) = D(m,n) //Symmetry
– D(0,m) = D(m,0) = 0 //Pure node cases(purity )
D(S)=0 means that all the examples in S have the same
class
– D(m,m) = 1 // Maximum uncertainty
D(S)=1 means that half the examples in S are of one class
and half are in the opposite class
Information Gain
• The Information Gain measures the expected
reduction in entropy due to splitting on an attribute A
k ni
GAIN split Entropy ( S ) Entropy (i )
i 1 n
Parent Node, S is split into k partitions; ni is number of
records in partition i
3 3 5 5
D(3 ,5 ) log 2 log 2 0.954
8 8 8 8
Which decision variable minimises the
disorder?
Test Average Disorder of the other
attributes
Hair 0.50
height 0.69
weight 0.94
lotion 0.61
• Which decision variable maximises the Info Gain then?
• Remember it’s the one which minimises the average disorder.
Gain(hair) = 0.954 - 0.50 = 0.454
Gain(height) = 0.954 - 0.69 =0.264
Gain(weight) = 0.954 - 0.94 =0.014
Gain (lotion) = 0.954 - 0.61 =0.344
The best decision tree?
is_sunburned
Hair colour
blonde
red brown
Emily Alex
? Pete
John
Sunburned = Sarah, Annie,
None = Dana, Katie
is_sunburned
Hair colour
blonde brown
red
Alex,
Emily
Lotion used Pete,
John
no yes
Sarah, Dana,
Annie Katie
Sunburn sufferers are ...
• You can view Decision Tree as an IF-THEN_ELSE
statement which tells us whether someone will suffer
from sunburn.
If (Hair-Colour=“red”) then
return (sunburned = yes)
else if (hair-colour=“blonde” and lotion-
used=“No”) then
return (sunburned = yes)
else
return (false)
Why decision tree induction in DM?
• Relatively faster learning speed (than other classification
methods)
• Convertible to simple and easy to understand classification if-
then-else rules
• Comparable classification accuracy with other methods
• Does not require any prior knowledge of data distribution,
works well on noisy data.
Pros Cons
+ Reasonable training Cannot handle complicated
time relationship between
+ Fast application features
+ Easy to interpret Simple decision boundaries
+ Easy to implement Problems with lots of missing
+Can handle large number data
of features 46
Bayesian Learning
Why Bayesian Classification?
• Provides practical learning algorithms
– Probabilistic learning: Calculate explicit probabilities for
hypothesis. E.g. Naïve Bayes
• Prior knowledge and observed data can be combined
– Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct.
• It is a generative (model based) approach, which offers a
useful conceptual framework
– Probabilistic prediction: Predict multiple hypotheses, weighted by
their probabilities. E.g. sequences could also be classified, based on
a probabilistic model specification
– Any kind of objects can be classified, based on a probabilistic model
specification
CONDITIONAL PROBABILITY
• Probability : How likely is it that an event will happen?
• Sample Space S
– An event A and C are a subset of S
arg max P (C ) P (Outl sunny | C ) P (Temp cool | C ) P ( Hum high | C ) P (Wind strong | C )
C[ yes , no ]
• Working:
P ( yes) P( sunny | yes) P(cool | yes) P(high | yes) P ( strong | yes) 0.0053
P (no) P ( sunny | no) P(cool | no) P(high | no) P( strong | no) 0.0206
answer : PlayTennis no
61
Brain vs. Machine
• The Brain
– Pattern Recognition
– Association
– Complexity
– Noise Tolerance
• The Machine
– Calculation
– Precision
– Logic
62
Features of the Brain
• Ten billion (1010) neurons
Neuron switching time >10-3secs
• Face Recognition ~0.1secs
• On average, each neuron has several thousand
connections
• Hundreds of operations per second
• High degree of parallel computation
• Distributed representations
• Die off frequently (never replaced)
• Compensated for problems by massive parallelism
63
Neural Network classifier
• It is represented as a layered set of interconnected
processors. These processor nodes has a relationship
with the neurons of the brain. Each node has a weighted
connection to several other nodes in adjacent layers.
Individual nodes take the input received from
connected nodes and use the weights together to
compute output values.
• The inputs are fed simultaneously into the input layer.
• The weighted outputs of these units are fed into hidden
layer.
• The weighted outputs of the last hidden layer are inputs
to units making up the output layer.
64
Architecture of Neural network
• Neural networks are used to look for patterns in data, learn
these patterns, and then classify new patterns & make forecasts
• A network with the input and output layer only is called single-
layered neural network. Whereas, a multilayer neural network
is a generalized one with one or more hidden layer.
– A network containing two hidden layers is called a three-layer neural
network, and so on.
Single layered NN Multilayer NN
n
x1 x1
w1 o ( wi xi )
x2 i 1 x2
w2
x3 w3 1 x3
( y)
1 e y Input Hidden Output
nodes nodes nodes
A Multilayer Neural Network
• INPUT: records with class attribute with
normalized attributes values. Output layer
–INPUT VECTOR: X = { x1, x2, …. xm}, where n
is the number of attributes.
–INPUT LAYER – there are as many nodes as
class attributes i.e. as the length of the input Hidden layer
vector.
• HIDDEN LAYER – neither its input nor its
output can be observed from outside.
–The number of nodes in the hidden layer
Input layer
and the number of hidden layers depends
on implementation.
• OUTPUT LAYER – corresponds to the class attribute.
–There are as many nodes as classes (values of the class
attribute).
–Ok, where k= 1, 2,.. n, where n is number of classes
Hidden layer: Neuron with Activation
• The neuron is the basic information processing unit of a NN. It
consists of:
1 A set of links, describing the neuron inputs, with weights W1,
W2, …, Wm
70
Training the neural network
• The purpose is to learn to generalize using a set of sample
patterns where the desired output is known.
• Back Propagation is the most commonly used method for
training multilayer feed forward NN.
– Back propagation learns by iteratively processing a set of training
data (samples).
– For each sample, weights are modified to minimize the error
between the desired output and the actual output.
• After propagating an input through the network, the error
is calculated and the error is propagated back through the
network while the weights are adjusted inorder to make
the error smaller.
71
Training Algorithm
• The applied learning algorithm is as follows
–Initialize the weights and threshold to small random
numbers.
–Present a vector x to the neuron inputs and calculate the
m
output using the adder function. y w x
j j
j 1
W j W j * ( yT y ) * x j
ANN Training Example
Given the following two inputs x1, x2;
find equation that helps to draw the Bias 1st input 2nd input Target
boundary? (x1) (x2) output
•Let say we have the following initializations: -1 0 0 0
W1(0) = 0.92, W2(0) = 0.62, W0(0) = 0.22, ή = -1 1 0 0
0.1
-1 0 1 1
• Training – epoch 1: -1 1 1 1
• Training – epoch 3:
y1 = 0.72*0 + 0.62*0 – 0.42 = -0.42 y = 0
y2 = 0.72*1 + 0.62*0 – 0.42 = 0.4 y = 1 X
W1(3) = 0.72 + 0.1 * (0 – 1) * 1 = 0.62
W2(3) = 0.62 + 0.1 * (0 – 1) * 0 = 0.62
W0(3) = 0.42 + 0.1 * (0 – 1) * (-1)= 0.52
y3 = 0.62*0 + 0.62*1 – 0.52 = 0.1 y = 1
y3 = 0.62*1 + 0.62*1 – 0.52 = 0.72 y = 1
ANN Training Example
• Training – epoch 4:
y1 = 0.62*0 + 0.62*0 – 0.52 = -0.52 y = 0
y2 = 0.62*1 + 0.62*0 – 0.52 = 0.10 y = 1 X
W1(4) = 0.62 + 0.1 * (0 – 1) * 1 = 0.52
W2(4) = 0.62 + 0.1 * (0 – 1) * 0 = 0.62
W0(4) = 0.52 + 0.1 * (0 – 1) * (-1)= 0.62
y3 = 0.52*0 + 0.62*1 – 0.62 = 0 y = 0
X
W1(4) = 0.52 + 0.1 * (1 – 0) * 0 = 0.52
W2(4) = 0.62 + 0.1 * (1 – 0) * 1 = 0.72
W0(4) = 0.62 + 0.1 * (1 – 0) * (-1)= 0.52
y4 = 0.52*1 + 0.72*1 – 0.52 = 0.72 y = 1
• Finally:
y1 = 0.52*0 + 0.72*0 – 0.52 = -0.52 y = 0
y2 = 0.52*1 + 0.72*0 – 0.52 = -0.0 y = 0
y3 = 0.52*0 + 0.72*1 – 0.52 = 0.2 y= 1
y4 = 0.52*1 + 0.72*1 – 0.52 = 0.72 y= 1
ANN Training Example
1+ + +
1 +
x2 x2
0o x1 1
o o
0 x1 1
o
Pros and Cons of Neural Network
• Useful for learning complex data like handwriting, speech
and image recognition
Cons
Pros
Slow training time
+ Can learn more complicated
Hard to interpret &
class boundaries understand the learned
+ Fast application function (weights)
+ Can handle large number of
Hard to implement: trial &
features error for choosing number of
nodes
Neural Network needs long time for training.
Neural Network has a high tolerance to noisy and
incomplete data
Conclusion: Use neural nets only if decision-trees
fail. 80