0% found this document useful (0 votes)

14 views62 pages

3 DM Classification

The document outlines the main tasks of data mining, focusing on predictive modeling and descriptive modeling. Predictive modeling involves forecasting values using historical data, while descriptive modeling aims to identify patterns and relationships within the data. Various algorithms and techniques such as classification, regression, clustering, and association rule discovery are discussed, highlighting their applications in areas like customer scoring and fraud detection.

Uploaded by

gemechisgadisa77

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views62 pages

3 DM Classification

Uploaded by

gemechisgadisa77

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 62

Data Mining Main Tasks

1
DM Task: Predictive Modeling
• A predictive model makes a prediction/forecast about
values of data using known results found from different
historical data
– Prediction Methods use existing variables to predict unknown
or future values of other variables.
• Predict one variable Y given a set of other variables X.
Here X could be an n-dimensional vector
– In effect this is function approximation through learning the
relationship between Y and X
• Many, many algorithms for predictive modeling in
statistics and machine learning, including
– Classification, regression, etc.
• Often the emphasis is on predictive accuracy, less
emphasis on understanding the model 2
Prediction Problems:
Classification vs. Numeric Prediction
• Classification
– predicts categorical class labels (discrete or nominal)
– classifies data (constructs a model) based on the
training set and the values (class labels) in a classifying
attribute and uses it in classifying new data

• Numeric Prediction
– models continuous-valued functions, i.e., predicts
unknown or missing values

3
Models and Patterns x f(x)
• Model = abstract representation of a given
training data 1 1
e.g., very simple linear model structure
Y=aX+b
2 4
– a and b are parameters determined from the data 3 9
– Y = aX + b is the model structure
– Y = 0.9X + 0.3 is a particular model 4 16
• Pattern represents “local structure” in a dataset
5 ?
– E.g., if X>x then Y >y with probability p
• Example: Given a finite sample, <x,f(x)> pairs, create a model
that can hold for future values?
To guess the true function f, find some pattern (called a
hypothesis) in the training examples, and assume that the
pattern will hold for future examples too.
5
Predictive Modeling: Customer Scoring
• Example: a bank has a database of 1 million past
customers, 10% of whom took out mortgages
• Use machine learning to rank new customers as a
function of p(mortgage | customer data)
• Customer data
– History of transactions with the bank
– Other credit data (obtained from Experian, etc)
– Demographic data on the customer or where they live
• Techniques
– Binary classification: logistic regression, decision trees,
etc
– Many, many applications of this nature 6
Classification

• Example: Credit scoring

– Differentiating between low-risk and high-risk customers from
their income and savings
Discriminant rule: IF income > θ1 AND savings > θ2
THEN low-risk
ELSE high-risk
Predictive Modeling: Fraud Detection
• Credit card fraud detection
– Credit card losses in the US are over 1 billion $ per year
– Roughly 1 in 50 transactions are fraudulent
• Approach
– For each transaction estimate p(fraudulent | transaction)
– Model is built on historical data of known fraud/non-fraud
– High probability transactions investigated by fraud police

• Example:
– Fair-Isaac/HNC’s fraud detection software based on neural
networks, led to reported fraud decreases of 30 to 50%
(http://www.fairisaac.com/fairisaac)

• Issues
– Significant feature engineering/preprocessing
– false alarm rate vs. missed detection – what is the tradeoff? 8
DM Task: Descriptive Modeling
• Goal is to build a “descriptive” model that models the underlying
observation
– e.g., a model that could simulate the data if needed
EM ITERATION 25

• Descriptive model identifies 4.4

patterns or relationship in data

4.3

Red Blood Cell Hemoglobin Concentration

4.2
– Unlike the predictive model, a
4.1

descriptive model serves as a way

to explore the properties of the
4

data examined, not to predict new

3.9

properties
3.8

3.7
3.3 3.4 3.5 3.6 3.7 3.8 3.9 4
Red Blood Cell Volume

• Description Methods find human-interpretable patterns that

describe and find natural groupings of the data.
• Methods used in descriptive modeling are: clustering,
summarization, association rule discovery, etc.
9
Example of Descriptive Modeling
• goal: learn directed relationships among p variables
• techniques: directed (causal) graphs
• challenge: distinguishing between correlation and causation
– Example: Do yellow fingers cause lung cancer?

smoking hidden cause:

smoking

yellow fingers cancer

10
Pattern (Association Rule) Discovery
• Goal is to discover interesting “local” patterns
(sequential patterns) in the data rather than to
characterize the data globally
– Also called link analysis (uncovers relationships
among data)

• Given market basket data we might discover that

– If customers buy wine and bread then they buy
cheese with probability 0.9

• Methods used in pattern discovery include:

– Association rules, Sequence discovery, etc.

11
Example of Pattern Discovery
• Example in retail: Customer transactions to consumer behavior:
– People who bought “Da Vinci Code” also bought “The Five
People You Meet in Heaven” (www.amazon.com)
• Example: football player behavior
– If player A is in the game, player B’s scoring rate increases
from 25% chance per game to 95% chance per game
• What about the following?
ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDABABBCDDDCDDABDC
BBDBDBCBBABBBCBBABCBBACBBDBAACCADDADBDBBCBBCCBBBDCABDDBBADDBBBB
CCACDABBABDDCDDBBABDBDDBDDBCACDBBCCBBACDCADCBACCADCCCACCDDADCBC
ADADBAACCDDDCBDBDCCCCACACACCDABDDBCADADBCBDDADABCCABDAACABCABAC
BDDDCBADCBDADDDDCDDCADCCBBADABBAAADAAABCCBCABDBAADCBCDACBCABABC
CBACBDABDDDADAABADCDCCDBBCDBDADDCCBBCDBAADADBCAAAADBDCADBDBBBCD
CCBCCCDCCADAADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBCB
DBDBADBBBBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCCDBAAADDD
BDDCABACBCADCDCBAAADCADDADAABBACCBB
12
Basic Data Mining algorithms
• Classification (which is also called Supervised learning) maps
data into predefined groups or classes to enhance the
prediction process
• Clustering (which is also called Unsupervised learning )
groups similar data together into clusters.
– is used to find appropriate groupings of elements for a set of
data.
– Unlike classification, clustering is a kind of undirected knowledge
discovery or unsupervised learning; that is, there is no target
field, and the relationship among the data is identified by
bottom-up approach.
• Association Rule (is also known as market basket analysis)
– It discovers interesting associations between attributes
contained in a database.
– Based on frequency counts of the number of items occur in the
event, association rule tells if item X is a part of the event, then
what is the percentage of item Y is also part of the event. 14
Classification

15
Classification: Definition
• Classification is a data mining (machine learning) technique
used to predict group membership for data instances.
• Given a collection of records (training set), each record
contains a set of attributes, one of the attributes is the class.
– Find a model for class attribute as a function of the values of
other attributes.
• Goal: previously unseen records should be assigned a class as
accurately as possible. A test set is used to determine the
accuracy of the model.
– Usually, the given data set is divided into training and test sets,
with training set used to build the model and test set used to
validate it.
• For example, one may use classification to predict whether the
weather on a particular day will be “sunny”, “rainy” or “cloudy”.
16
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision trees,
or mathematical formulae
• Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test sample is compared with the
classified result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set
– If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
17
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
No
1 Yes Large 125K
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn

8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction

14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Confusion Matrix for Performance Evaluation
PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes a b
CLASS (TP) (FP)
Class=No c d
(FP) (TP)

• Most widely-used metric is measuring Accuracy of the

system : ad TP
Accuracy   *100
a b c d TP  FP

• Other metric for performance evaluation are Precision,

Recall & F-Measure
Classification methods
• Goal: Predict class Ci = f(x1, x2, .. xn)
• There are various classification methods.
Popular classification techniques include the
following.
– K-nearest neighbor
– Decision tree classifier: divide decision space into
piecewise constant regions.
– Neural networks: partition by non-linear boundaries
– Bayesian network: a probabilistic model
– Support vector machine
20
K-Nearest Neighbors
• K-nearest neighbor is a supervised learning algorithm where
the result of new instance query is classified based on majority
of K-nearest neighbor category.
• The purpose of this algorithm is to classify a new object based
on attributes and training samples: (xn, f(xn)), n=1..N.
• Given a query point, we find K number of objects or (training
points) closest to the query point.
– The classification is using majority vote among the classification
of the K objects.
– K Nearest neighbor algorithm used neighborhood classification
as the prediction value of the new query instance.
• K nearest neighbor algorithm is very simple. It works based on
minimum distance from the query instance to the training
samples to determine the K-nearest neighbors.
21
K-Nearest Neighbor (KNN) Algorithm
• Determine parameter K = number of nearest neighbors
• Calculate the distance between the query-instance and
all the training samples
– we can use Euclidean distance
• Sort the distance in increasing order and determine
nearest neighbors based on the Kth minimum distance
• Gather the category of the nearest neighbors
• Use simple majority voting of the category of nearest
neighbors as the prediction value of the query instance
– Any ties can be broken at random strategically.
K Nearest Neighbors: Key issues
The key issues involved in training KNN model includes
• Setting the variable K - Number of nearest neighbors
–The numbers of nearest neighbors (K) should be based on cross
validation over a number of K setting.
–When k=1 is a good baseline model to benchmark against.
–A good rule-of-thumb is that K should be less than or equal to the
square root of the total number of training patterns. K  N
• Setting the type of distant metric
–We need a measure of distance in order to know who are the
neighbours
–Assume that we have T attributes for the learning problem. Then
one example point x has elements xt  , t=1,…T.
–The distance between two points xi xj is often defined as the
Euclidean distance: D 2

Dist ( X , Y )   ( Xi  Yi )
i 1
23
Example
• We have data from the questionnaires survey (to ask
people opinion) and objective testing with two attributes
(acid durability and strength) to classify whether a special
paper tissue is good or not. Here is four training samples.
Y = Classification
X1 = Acid Durability (seconds) X2 = Strength (kg/m2)
7 7 Bad
7 4 Bad
3 4 Good
1 4 Good
• Now the factory produces a new paper tissue that pass
laboratory test with X1 = 3 and X2 = 7.
– Without undertaking another expensive survey, guess the
goodness of the new tissue? Use squared Euclidean distance for
similarity measurement.
Solution
X1 = Acid X2 = Square Distance Rank Is it Y=
Durability Strength to query instance minimum included Category
(seconds) (kg/m2) (3, 7) distance in 3- of NN
NNs?
7 7 3 Yes Bad
7 4 4 No -
3 4 1 Yes Good
1 4 2 Yes Good
• Use simple majority of the category of nearest neighbors as the prediction value
of the query instance. We have 2 good and 1 bad, since 2>1 then we conclude
that a new paper tissue that pass laboratory test with X1 = 3 and X2 = 7 is
included in Good category.
KNNs: advantages & Disadvantages
• Advantage
– Nonparametric architecture
– Simple
– Powerful
– Requires no training time
• Disadvantage: Difficulties with k-nearest neighbour
algorithms
– Memory intensive: just store the training examples
• when a test example is given then find the closest matches
– Classification/estimation is slow
– Have to calculate the distance of the test case from all training
cases
– There may be irrelevant attributes amongst the attributes –
curse of dimensionality
27
Decision Tree

29
Decision Trees
• Decision tree constructs a tree where internal nodes
are simple decision rules on one or more attributes
and leaf nodes are predicted class labels.
Given an instance of an object or situation, which is
specified by a set of properties, the tree returns a
"yes" or "no" decision about that instance.
Attribute_1
value-1 value-3
value-2
Attribute_2 Class1 Attribute_2

value-5 value-4 value-6 value-7

Class2 Class1 Class1 Class2
Choosing the Splitting Attribute
• At each node, the best attribute is selected for splitting the
training examples using a Goodness function
– The best attribute is the one that separate the classes of the
training examples faster such that it results in the smallest tree
• Typical goodness functions:
– information gain, information gain ratio, and Gini index

• Information Gain
– Select the attribute with the highest information gain, that
create small average disorder
• First, compute the disorder using Entropy; the expected
information needed to classify objects into classes
• Second, measure the Information Gain; to calculate by how
much the disorder of a set would reduce by knowing the value
of a particular attribute.
Entropy
• The Entropy measures the disorder of a set S containing a
total of n examples of which n+ are positive and n- are
negative and it is given by:
n n n n
D(n , n )  log 2  log 2 Entropy ( S )
n n n n
• Some useful properties of the Entropy:
– D(n,m) = D(m,n) //Symmetry
– D(0,m) = D(m,0) = 0 //Pure node cases(purity )
D(S)=0 means that all the examples in S have the same
class
– D(m,m) = 1 // Maximum uncertainty
D(S)=1 means that half the examples in S are of one class
and half are in the opposite class
Information Gain
• The Information Gain measures the expected
reduction in entropy due to splitting on an attribute A
 k ni 
GAIN split Entropy ( S )    Entropy (i ) 
 i 1 n 
Parent Node, S is split into k partitions; ni is number of
records in partition i

• Information Gain: Measures Reduction in Entropy

achieved because of the split. Choose the split that
achieves most reduction (maximizes GAIN)
Example 1: The problem of “Sunburn”
• You want to predict whether another person is likely to get
sunburned if he is back to the beach. How can you do this?
• Data Collected: predict based on the observed properties of the
people
Name Hair Height Weight Lotion Result
Sarah Blonde Average Light No Sunburned
Dana Blonde Tall Average Yes None
Alex Brown Short Average Yes None
Annie Blonde Short Average No Sunburned
Emily Red Average Heavy No Sunburned
Pete Brown Tall Heavy No None
John Brown Average Heavy No None
Kate Blonde Short Light Yes None
Attribute Selection by Information Gain
to construct the optimal decision tree

• Entropy: The Disorder of Sunburned

D({ “Sarah”,“Dana”,“Alex”,“Annie”, “Emily”,“Pete”,“John”,“Katie”})

3 3 5 5
D(3 ,5 )  log 2  log 2 0.954
8 8 8 8
Which decision variable minimises the
disorder?
Test Average Disorder of the other
attributes
Hair 0.50
height 0.69
weight 0.94
lotion 0.61
• Which decision variable maximises the Info Gain then?
• Remember it’s the one which minimises the average disorder.
 Gain(hair) = 0.954 - 0.50 = 0.454
 Gain(height) = 0.954 - 0.69 =0.264
 Gain(weight) = 0.954 - 0.94 =0.014
 Gain (lotion) = 0.954 - 0.61 =0.344
The best decision tree?
is_sunburned
Hair colour
blonde
red brown
Emily Alex
? Pete
John
Sunburned = Sarah, Annie,
None = Dana, Katie

• Once we have finished with hair colour we then need to

calculate the remaining branches of the decision tree.
• Which attributes is better to classify the remaining ?
The best Decision Tree
• This is the simplest and optimal one possible and it makes a
lot of sense.
• It classifies 4 of the people on just the hair colour alone.

is_sunburned
Hair colour
blonde brown
red
Alex,
Emily
Lotion used Pete,
John
no yes

Sarah, Dana,
Annie Katie
Sunburn sufferers are ...
• You can view Decision Tree as an IF-THEN_ELSE
statement which tells us whether someone will suffer
from sunburn.

If (Hair-Colour=“red”) then
return (sunburned = yes)
else if (hair-colour=“blonde” and lotion-
used=“No”) then
return (sunburned = yes)
else
return (false)
Why decision tree induction in DM?
• Relatively faster learning speed (than other classification
methods)
• Convertible to simple and easy to understand classification if-
then-else rules
• Comparable classification accuracy with other methods
• Does not require any prior knowledge of data distribution,
works well on noisy data.

Pros Cons
+ Reasonable training Cannot handle complicated
time relationship between
+ Fast application features
+ Easy to interpret Simple decision boundaries
+ Easy to implement Problems with lots of missing
+Can handle large number data
of features 46
Bayesian Learning
Why Bayesian Classification?
• Provides practical learning algorithms
– Probabilistic learning: Calculate explicit probabilities for
hypothesis. E.g. Naïve Bayes
• Prior knowledge and observed data can be combined
– Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct.
• It is a generative (model based) approach, which offers a
useful conceptual framework
– Probabilistic prediction: Predict multiple hypotheses, weighted by
their probabilities. E.g. sequences could also be classified, based on
a probabilistic model specification
– Any kind of objects can be classified, based on a probabilistic model
specification
CONDITIONAL PROBABILITY
• Probability : How likely is it that an event will happen?
• Sample Space S
– An event A and C are a subset of S

• P(C / A)- Probability that event C occurs given that event

A has already occurred. P ( A, C )
P (C | A) 
P ( A)
Example of conditional probability:
• There are 2 baskets. B1 has 2 red ball and 5 blue ball. B2
has 4 red ball and 3 blue ball.
– Find probability of picking a red ball from basket 1?
P(red ball | basket 1) =
– What about P(basket2 | red ball) ?
Bayes Classifier
• A probabilistic framework for solving classification problems
• Bayes theorem:
P ( A, C ) P ( A, C )
P (C | A)  P( A | C ) 
P ( A) P (C )
P ( A | C ) P (C )
P (C | A) 
P ( A)
• Example of Bayes Theorem
– Given: A doctor knows that meningitis causes stiff neck 50% of the
time. Prior probability of any patient having meningitis is 1/50.
Prior probability of any patient having stiff neck is 1/20. If a
patient has stiff neck, what’s the probability he/she has
meningitis?
Bayes Theorem
• Example 2: A medical cancer diagnosis problem. There are 2
possible outcomes of a diagnosis: +ve, -ve.
We know 0.8% of world population has cancer. Test gives correct
+ve result 98% of the time and gives correct –ve result 97% of
the time. If a patient’s test returns +ve, should we diagnose the
patient as having cancer?
P(cancer) = 0.008 p(no-cancer) = 0.992
P(+ve|cancer) = 0.98 P(-ve|cancer) = 0.02
P(+ve|no-cancer) = 0.03 P(-ve|no-cancer) = 0.97

Using Bayes Formula:

– P(cancer|+ve) = P(+ve|cancer)xP(cancer) / P(+ve)
= 0.98 x 0.008 = 0.0078 / P(+ve)
– P(no-cancer|+ve) = P(+ve|no-cancer)xP(no-cancer) / P(+ve)
= 0.03 x 0.992 = 0.0298 / P(+ve)
So, the patient most likely does not have cancer.
General Bayes Theorem
• Consider each attribute and class label as random variables
• Given a record with attributes (A1, A2,…,An)
– Goal is to predict class C
– Specifically, we want to find the value of C that maximizes P(C|
A1, A2,…,An )
• Can we estimate P(C| A1, A2,…,An ) directly from data?
– Approach: compute the posterior probability P(C | A1, A2, …, An)
for all values of C using the Bayes theorem
P ( A1 A2  An | C ) P (C )
P (C | A1 A2  An ) 
P ( A1 A2  An )

– Choose value of C that maximizes: P(C | A1, A2, …, An)

– Can estimate P(Ai| Cj) for all Ai and Cj.

– New point is classified to Cj if P(Cj)  P(Ai| Cj) is

maximal.
C Naive Bayes arg max P (C j ) P ( Ai | C j )
j i
Example. ‘Play Tennis’ data
• Suppose that you have a free afternoon and you are thinking whether
or not to go and play tennis, How you do that?
Based on the following training data, predict when this player will
Play Tennis?
Day Outlook Temperature Humidity Wind Play Tennis
Day1 Sunny Hot High Weak No
Day2 Sunny Hot High Strong No
Day3 Overcast Hot High Weak Yes
Day4 Rain Mild High Weak Yes
Day5 Rain Cool Normal Weak Yes
Day6 Rain Cool Normal Strong No
Day7 Overcast Cool Normal Strong Yes
Day8 Sunny Mild High Weak No
Day9 Sunny Cool Normal Weak Yes
Day10 Rain Mild Normal Weak Yes
Day11 Sunny Mild Normal Strong Yes
Day12 Overcast Mild High Strong Yes
Day13 Overcast Hot Normal Weak Yes
Day14 Rain Mild High Strong No
Play-tennis example
• Based on the model created, predict Play Tennis or Not for the
following unseen sample
(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
C NB arg max P (C ) P (at | C )
C[ yes , no ] t

arg max P (C ) P (Outl sunny | C ) P (Temp cool | C ) P ( Hum high | C ) P (Wind strong | C )
C[ yes , no ]

• Working:

• More example: What if the following test data is given:

X= <Outlook=rain, Temperature=hot, Humidity=high, Wind=weak>
Exercise: Naïve Bayes Classifier
Give Can Live in Have
Name Birth Fly Water Legs Class
human yes no no yes mammals A: attributes
python no no no no non-mammals
salmon no no yes no non-mammals M: mammals
whale yes no yes no mammals
frog no no sometimes yes non-mammals N: non-mammals
komodo no no no yes non-mammals
bat yes yes no yes mammals
pigeon no yes no yes non-mammals 6 6 2 2
P ( A | M )     0.06
cat yes no no yes mammals 7 7 7 7
leopard shark yes no yes no non-mammals 1 10 3 4
turtle no no sometimes yes non-mammals P ( A | N )     0.0042
13 13 13 13
penguin no no sometimes yes non-mammals
porcupine yes no no yes mammals 7
P ( A | M ) P ( M ) 0.06  0.021
eel no no yes no non-mammals 20
salamander no no sometimes yes non-mammals 13
gila monster no no no yes non-mammals P ( A | N ) P ( N ) 0.004  0.0027
20
platypus no no no yes mammals
owl no yes no yes non-mammals
dolphin yes no yes no mammals
P(A|M)P(M) > P(A|
eagle no yes no yes non-mammals N)P(N)
Give Birth Can Fly Live in Water Have Legs Class => Mammals
yes no yes no ?
Neural Network

61
Brain vs. Machine
• The Brain
– Pattern Recognition
– Association
– Complexity
– Noise Tolerance

• The Machine
– Calculation
– Precision
– Logic
62
Features of the Brain
• Ten billion (1010) neurons
 Neuron switching time >10-3secs
• Face Recognition ~0.1secs
• On average, each neuron has several thousand
connections
• Hundreds of operations per second
• High degree of parallel computation
• Distributed representations
• Die off frequently (never replaced)
• Compensated for problems by massive parallelism
63
Neural Network classifier
• It is represented as a layered set of interconnected
processors. These processor nodes has a relationship
with the neurons of the brain. Each node has a weighted
connection to several other nodes in adjacent layers.
Individual nodes take the input received from
connected nodes and use the weights together to
compute output values.
• The inputs are fed simultaneously into the input layer.
• The weighted outputs of these units are fed into hidden
layer.
• The weighted outputs of the last hidden layer are inputs
to units making up the output layer.
64
Architecture of Neural network
• Neural networks are used to look for patterns in data, learn
these patterns, and then classify new patterns & make forecasts
• A network with the input and output layer only is called single-
layered neural network. Whereas, a multilayer neural network
is a generalized one with one or more hidden layer.
– A network containing two hidden layers is called a three-layer neural
network, and so on.
Single layered NN Multilayer NN
n
x1 x1
w1 o  (  wi xi )
x2 i 1 x2
w2
x3 w3 1 x3
 ( y) 
1  e y Input Hidden Output
nodes nodes nodes
A Multilayer Neural Network
• INPUT: records with class attribute with
normalized attributes values. Output layer
–INPUT VECTOR: X = { x1, x2, …. xm}, where n
is the number of attributes.
–INPUT LAYER – there are as many nodes as
class attributes i.e. as the length of the input Hidden layer
vector.
• HIDDEN LAYER – neither its input nor its
output can be observed from outside.
–The number of nodes in the hidden layer
Input layer
and the number of hidden layers depends
on implementation.
• OUTPUT LAYER – corresponds to the class attribute.
–There are as many nodes as classes (values of the class
attribute).
–Ok, where k= 1, 2,.. n, where n is number of classes
Hidden layer: Neuron with Activation
• The neuron is the basic information processing unit of a NN. It
consists of:
1 A set of links, describing the neuron inputs, with weights W1,
W2, …, Wm

2. An adder function (linear combiner) for computing the weighted

sum of the inputs (real numbers):
m
y  w jx j
j 1

3. Activation function (also called squashing function): for limiting

the output behavior of the neuron.
y  (y  b)
Activation Functions

• (a) is a step function or threshold function (hardlimiting):

• (b) is a sigmoid function: 1/(1+e -x)
• Changing the bias weight W0,i moves the threshold location
– Bias helps the neural network to be more flexible since it adjust the activation
function left-or-right, making it centered on some other value than x = 0. To
this effect an additional node is added to the input layer, with its constant
input; say, 1 or -1, … When this is multiplied by the weights of the hidden layer,
it provides a bias (DC offset) to activation function.
Two Topologies of neural network
• NN can be designed in a feed forward or recurrent
manner
• In a feed forward neural network connections
between the units do not form a directed cycle.
– In this network, the information moves in only one
direction, forward, from the input nodes, through the
hidden nodes (if any) & to the output nodes. There are
no cycles or loops or no feedback connections are
present in the network, that is, connections extending
from outputs of units to inputs of units in the same
layer or previous layers.
• In recurrent networks data circulates back &
forth until the activation of the units is stabilized
– Recurrent networks have a feedback loop where data
can be fed back into the input at some point before it is
fed forward again for further processing and final
output.

70
Training the neural network
• The purpose is to learn to generalize using a set of sample
patterns where the desired output is known.
• Back Propagation is the most commonly used method for
training multilayer feed forward NN.
– Back propagation learns by iteratively processing a set of training
data (samples).
– For each sample, weights are modified to minimize the error
between the desired output and the actual output.
• After propagating an input through the network, the error
is calculated and the error is propagated back through the
network while the weights are adjusted inorder to make
the error smaller.
71
Training Algorithm
• The applied learning algorithm is as follows
–Initialize the weights and threshold to small random
numbers.
–Present a vector x to the neuron inputs and calculate the
m
output using the adder function. y w x
 j j
j 1

–Apply the activation function such that


 0 if y 0
y 

 1 if y  0

–Update the weights according to the error.

W j W j   * ( yT  y ) * x j
ANN Training Example
Given the following two inputs x1, x2;
find equation that helps to draw the Bias 1st input 2nd input Target
boundary? (x1) (x2) output
•Let say we have the following initializations: -1 0 0 0
W1(0) = 0.92, W2(0) = 0.62, W0(0) = 0.22, ή = -1 1 0 0
0.1
-1 0 1 1
• Training – epoch 1: -1 1 1 1

y1 = 0.920 + 0.620 – 0.22 = -0.22  y = 0

y2 = 0.921 + 0.620 – 0.22 = 0.7  y =1 X

W1(1) = 0.92 + 0.1 * (0 – 1) * 1 = 0.82

W2(1) = 0.62 + 0.1 * (0 – 1) * 0 = 0.62

W0(1) = 0.22 + 0.1 * (0 – 1) * (-1)= 0.32

y3 = 0.820 + 0.621 – 0.32 = 0.3  y = 1

y4 = 0.821 + 0.621 – 0.32 = 1.12  y =1

ANN Training Example
• Training – epoch 2:
y1 = 0.82*0 + 0.62*0 – 0.32 = -0.32  y= 0
y2 = 0.82*1 + 0.62*0 – 0.32 = 0.5  y= 1 X
W1(2) = 0.82 + 0.1 * (0 – 1) * 1 = 0.72
W2(2) = 0.62 + 0.1 * (0 – 1) * 0 = 0.62
W0(2) = 0.32 + 0.1 * (0 – 1) * (-1)= 0.42
y3 = 0.72*0 + 0.62*1 – 0.42 = 0.2  y= 1
y4 = 0.72*1 + 0.62*1 – 0.42 = 0.92  y = 1

• Training – epoch 3:
y1 = 0.72*0 + 0.62*0 – 0.42 = -0.42  y = 0
y2 = 0.72*1 + 0.62*0 – 0.42 = 0.4  y = 1 X
W1(3) = 0.72 + 0.1 * (0 – 1) * 1 = 0.62
W2(3) = 0.62 + 0.1 * (0 – 1) * 0 = 0.62
W0(3) = 0.42 + 0.1 * (0 – 1) * (-1)= 0.52
y3 = 0.62*0 + 0.62*1 – 0.52 = 0.1 y = 1
y3 = 0.62*1 + 0.62*1 – 0.52 = 0.72 y = 1
ANN Training Example
• Training – epoch 4:
y1 = 0.62*0 + 0.62*0 – 0.52 = -0.52  y = 0
y2 = 0.62*1 + 0.62*0 – 0.52 = 0.10 y = 1 X
W1(4) = 0.62 + 0.1 * (0 – 1) * 1 = 0.52
W2(4) = 0.62 + 0.1 * (0 – 1) * 0 = 0.62
W0(4) = 0.52 + 0.1 * (0 – 1) * (-1)= 0.62
y3 = 0.52*0 + 0.62*1 – 0.62 = 0  y = 0
X
W1(4) = 0.52 + 0.1 * (1 – 0) * 0 = 0.52
W2(4) = 0.62 + 0.1 * (1 – 0) * 1 = 0.72
W0(4) = 0.62 + 0.1 * (1 – 0) * (-1)= 0.52
y4 = 0.52*1 + 0.72*1 – 0.52 = 0.72  y = 1

• Finally:
y1 = 0.52*0 + 0.72*0 – 0.52 = -0.52  y = 0
y2 = 0.52*1 + 0.72*0 – 0.52 = -0.0  y = 0
y3 = 0.52*0 + 0.72*1 – 0.52 = 0.2  y= 1
y4 = 0.52*1 + 0.72*1 – 0.52 = 0.72  y= 1
ANN Training Example

1+ + +
1 +

x2 x2

0o x1 1
o o
0 x1 1
o
Pros and Cons of Neural Network
• Useful for learning complex data like handwriting, speech
and image recognition
Cons
Pros
Slow training time
+ Can learn more complicated
Hard to interpret &
class boundaries understand the learned
+ Fast application function (weights)
+ Can handle large number of
Hard to implement: trial &
features error for choosing number of
nodes
Neural Network needs long time for training.
Neural Network has a high tolerance to noisy and
incomplete data
Conclusion: Use neural nets only if decision-trees
fail. 80

Chapter 4 Classification
No ratings yet
Chapter 4 Classification
78 pages
3 DM Classification
No ratings yet
3 DM Classification
55 pages
Data Science & Analytics Basics
No ratings yet
Data Science & Analytics Basics
71 pages
Big Data Analytics - Unit 3
No ratings yet
Big Data Analytics - Unit 3
55 pages
DSand ML
No ratings yet
DSand ML
76 pages
DM Chapter 4
No ratings yet
DM Chapter 4
47 pages
Unit Iii Classification
No ratings yet
Unit Iii Classification
57 pages
Classification
No ratings yet
Classification
50 pages
Data Mining Jntuh Cse R18
No ratings yet
Data Mining Jntuh Cse R18
20 pages
U4 Clasification and Prediction
No ratings yet
U4 Clasification and Prediction
15 pages
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
No ratings yet
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
43 pages
18mca52c U3
No ratings yet
18mca52c U3
8 pages
Unit 3 DM
No ratings yet
Unit 3 DM
34 pages
Classification Notes
No ratings yet
Classification Notes
14 pages
Classification: Unit-III
No ratings yet
Classification: Unit-III
90 pages
DM Unit - 3
No ratings yet
DM Unit - 3
21 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
13 pages
Data Mining UNIT-2 Notes
No ratings yet
Data Mining UNIT-2 Notes
91 pages
Classification in Data Mining 12
No ratings yet
Classification in Data Mining 12
7 pages
Classification Unit3
No ratings yet
Classification Unit3
15 pages
Data Mining-Unit-3
No ratings yet
Data Mining-Unit-3
16 pages
Data Mining and Warehousing Mod3
No ratings yet
Data Mining and Warehousing Mod3
69 pages
Data Mining Module 3
No ratings yet
Data Mining Module 3
27 pages
Data Mining Classification Prediction
No ratings yet
Data Mining Classification Prediction
3 pages
Unit-4 AML (1. Basics and K-NN)
No ratings yet
Unit-4 AML (1. Basics and K-NN)
25 pages
Data Mining Basics for Beginners
No ratings yet
Data Mining Basics for Beginners
20 pages
9 Data Mining - Classification & Prediction
No ratings yet
9 Data Mining - Classification & Prediction
4 pages
DM - Ch4 - Classification (Part1)
No ratings yet
DM - Ch4 - Classification (Part1)
20 pages
DM Unit 4
No ratings yet
DM Unit 4
22 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
9 pages
Preface To The Second Edition V 1 1
No ratings yet
Preface To The Second Edition V 1 1
9 pages
ClassificationandPrediction Module3
No ratings yet
ClassificationandPrediction Module3
88 pages
Classification Analysis
No ratings yet
Classification Analysis
4 pages
Data Mining Module 2
No ratings yet
Data Mining Module 2
21 pages
Unit 3
No ratings yet
Unit 3
53 pages
Classification & Prediction
No ratings yet
Classification & Prediction
19 pages
Classification
No ratings yet
Classification
15 pages
Classifiction
No ratings yet
Classifiction
42 pages
Data Science Lecture: Classification & Regression
No ratings yet
Data Science Lecture: Classification & Regression
27 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
Classification & Prediction Guide
No ratings yet
Classification & Prediction Guide
83 pages
Data Mining
No ratings yet
Data Mining
37 pages
Pattern Recognition Application
No ratings yet
Pattern Recognition Application
43 pages
Unit 4 Classification
No ratings yet
Unit 4 Classification
87 pages
Data Mining 4th Is
No ratings yet
Data Mining 4th Is
24 pages
231
No ratings yet
231
10 pages
UNIT 3 Updated
No ratings yet
UNIT 3 Updated
68 pages
DM Unit-3
No ratings yet
DM Unit-3
46 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
129 pages
Chapter 8
No ratings yet
Chapter 8
15 pages
Data Mining and Classification Basics
No ratings yet
Data Mining and Classification Basics
129 pages
Chapter3 Classification and Prediction
No ratings yet
Chapter3 Classification and Prediction
63 pages
Chapter 6 - : Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign
No ratings yet
Chapter 6 - : Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign
129 pages
FDS Unit-4
No ratings yet
FDS Unit-4
15 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
50 pages
What Is Classification? What Is Prediction?
No ratings yet
What Is Classification? What Is Prediction?
36 pages
Classification (Part II)
No ratings yet
Classification (Part II)
162 pages
CH 2
No ratings yet
CH 2
37 pages
Event Driven Programming Mock Exam Questions
No ratings yet
Event Driven Programming Mock Exam Questions
5 pages
System Admn CH 3
100% (1)
System Admn CH 3
128 pages
IAS Model Exit Exam Up
No ratings yet
IAS Model Exit Exam Up
4 pages
Indvidual Ass1
No ratings yet
Indvidual Ass1
1 page
IT Program Computer Laboratory Assignment For National Exit Exam
No ratings yet
IT Program Computer Laboratory Assignment For National Exit Exam
2 pages
Updated IT - Year II Sem I Final Exam Schedule
No ratings yet
Updated IT - Year II Sem I Final Exam Schedule
1 page
Data Com - Network Dev
No ratings yet
Data Com - Network Dev
3 pages
Sad ch7b
No ratings yet
Sad ch7b
42 pages
Sad Ch6
No ratings yet
Sad Ch6
39 pages
Chapter 7
No ratings yet
Chapter 7
70 pages
Model 3 Exit Exam - Attempt Review
No ratings yet
Model 3 Exit Exam - Attempt Review
66 pages
Chapter 7 DCCN
No ratings yet
Chapter 7 DCCN
78 pages
Language Translation
No ratings yet
Language Translation
4 pages
SE - WEB - Chapter 5 PHPv1
No ratings yet
SE - WEB - Chapter 5 PHPv1
76 pages
IT Year IV Sem I Class Schedule
No ratings yet
IT Year IV Sem I Class Schedule
1 page
SETP Chap2
No ratings yet
SETP Chap2
35 pages
GIS Assignment - Group 5
No ratings yet
GIS Assignment - Group 5
7 pages
Chapter Three - Part I
No ratings yet
Chapter Three - Part I
54 pages
Chapter-4 and 5
No ratings yet
Chapter-4 and 5
66 pages
Zero Trust Architecture Seminar Report
No ratings yet
Zero Trust Architecture Seminar Report
14 pages
Chapter One: Objective of The Internship
No ratings yet
Chapter One: Objective of The Internship
17 pages
Chapter-04-Knowledge Representation and Reasoning
No ratings yet
Chapter-04-Knowledge Representation and Reasoning
71 pages
Java Chapter-1
No ratings yet
Java Chapter-1
48 pages
Chapter-03-Searching and Planning
No ratings yet
Chapter-03-Searching and Planning
107 pages
Network Devices
No ratings yet
Network Devices
40 pages
CH 4 RM
No ratings yet
CH 4 RM
85 pages
2024 INTERNSHIP GM
No ratings yet
2024 INTERNSHIP GM
13 pages
Internship Report Last
No ratings yet
Internship Report Last
33 pages
NPTEL Live Session Week 1 Deep Learning-IIT Ropar
No ratings yet
NPTEL Live Session Week 1 Deep Learning-IIT Ropar
26 pages
Animal Classification via Body Markings
No ratings yet
Animal Classification via Body Markings
58 pages
Solar Forecasting with ML Techniques
No ratings yet
Solar Forecasting with ML Techniques
6 pages
AI in Civil Engineering Review
No ratings yet
AI in Civil Engineering Review
31 pages
Feedforward Neural Networks - Part 2 - Parveen Khurana - Medium
No ratings yet
Feedforward Neural Networks - Part 2 - Parveen Khurana - Medium
39 pages
Deep Learning Basics Explained
No ratings yet
Deep Learning Basics Explained
21 pages
Generating High Frequency Trading Strategies With Artificial - PDF Room
No ratings yet
Generating High Frequency Trading Strategies With Artificial - PDF Room
120 pages
Machine Learning in Geomechanics by Khatibi
No ratings yet
Machine Learning in Geomechanics by Khatibi
15 pages
Unit4 PPT
No ratings yet
Unit4 PPT
126 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
5 pages
55 Learning Objective Fort Lns
No ratings yet
55 Learning Objective Fort Lns
2 pages
Applications of PINNs For Property Characterization of Complex Materials
No ratings yet
Applications of PINNs For Property Characterization of Complex Materials
11 pages
Working of Multi-Layer Perceptron
No ratings yet
Working of Multi-Layer Perceptron
16 pages
Neural Networks Applications in Textile
No ratings yet
Neural Networks Applications in Textile
34 pages
Deepfake - Content-3
No ratings yet
Deepfake - Content-3
81 pages
Application of Artificial Intelligence in Mechanic
No ratings yet
Application of Artificial Intelligence in Mechanic
5 pages
Comparing Neural Architectures For Demand Response Through Model-Free Reinforcement Learning For Heat Pump Control - CameraReady
No ratings yet
Comparing Neural Architectures For Demand Response Through Model-Free Reinforcement Learning For Heat Pump Control - CameraReady
6 pages
Machine Learning Basics & Techniques
No ratings yet
Machine Learning Basics & Techniques
13 pages
Deep Learning for Tech Enthusiasts
No ratings yet
Deep Learning for Tech Enthusiasts
15 pages
Artificial Neural Networks: Seminar Report On
No ratings yet
Artificial Neural Networks: Seminar Report On
24 pages
Technical Seminar
No ratings yet
Technical Seminar
27 pages
Neural Network - Test Questions
No ratings yet
Neural Network - Test Questions
9 pages
MultilayerPerceptron Chapter9
No ratings yet
MultilayerPerceptron Chapter9
13 pages
B.E. CSE Semester VII Class Tests
No ratings yet
B.E. CSE Semester VII Class Tests
16 pages
New Approach of Classification of Rolling Element Bearing Fault Using Artificial Neural Network
No ratings yet
New Approach of Classification of Rolling Element Bearing Fault Using Artificial Neural Network
12 pages
BCS602 Model Question Paper Solved (Search Creators) - 2-37
0% (2)
BCS602 Model Question Paper Solved (Search Creators) - 2-37
36 pages
Deep Learning
No ratings yet
Deep Learning
7 pages
Machine Learning Wheat Analysis
No ratings yet
Machine Learning Wheat Analysis
8 pages
Neural Network Activation Functions
No ratings yet
Neural Network Activation Functions
15 pages
Artificial Intelligence - Chapter 7
No ratings yet
Artificial Intelligence - Chapter 7
18 pages

3 DM Classification

Uploaded by

3 DM Classification

Uploaded by

Data Mining Main Tasks

• Example: Credit scoring

• Descriptive model identifies 4.4

patterns or relationship in data

Red Blood Cell Hemoglobin Concentration

descriptive model serves as a way

data examined, not to predict new

• Description Methods find human-interpretable patterns that

smoking hidden cause:

yellow fingers cancer

• Given market basket data we might discover that

• Methods used in pattern discovery include:

4 Yes Medium 120K No

7 Yes Large 220K No Learn

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction

• Most widely-used metric is measuring Accuracy of the

• Other metric for performance evaluation are Precision,

value-5 value-4 value-6 value-7

• Information Gain: Measures Reduction in Entropy

• Entropy: The Disorder of Sunburned

D({ “Sarah”,“Dana”,“Alex”,“Annie”, “Emily”,“Pete”,“John”,“Katie”})

• Once we have finished with hair colour we then need to

• P(C / A)- Probability that event C occurs given that event

Using Bayes Formula:

– Choose value of C that maximizes: P(C | A1, A2, …, An)

– Can estimate P(Ai| Cj) for all Ai and Cj.

– New point is classified to Cj if P(Cj)  P(Ai| Cj) is

• More example: What if the following test data is given:

2. An adder function (linear combiner) for computing the weighted

3. Activation function (also called squashing function): for limiting

• (a) is a step function or threshold function (hardlimiting):

–Apply the activation function such that

–Update the weights according to the error.

y1 = 0.92*0 + 0.62*0 – 0.22 = -0.22  y = 0

y2 = 0.92*1 + 0.62*0 – 0.22 = 0.7  y =1 X

W2(1) = 0.62 + 0.1 * (0 – 1) * 0 = 0.62

W0(1) = 0.22 + 0.1 * (0 – 1) * (-1)= 0.32

y3 = 0.82*0 + 0.62*1 – 0.32 = 0.3  y = 1

y4 = 0.82*1 + 0.62*1 – 0.32 = 1.12  y =1

You might also like

y1 = 0.920 + 0.620 – 0.22 = -0.22  y = 0

y2 = 0.921 + 0.620 – 0.22 = 0.7  y =1 X

y3 = 0.820 + 0.621 – 0.32 = 0.3  y = 1

y4 = 0.821 + 0.621 – 0.32 = 1.12  y =1