0% found this document useful (0 votes)
32 views59 pages

Cs217 Perceptron Capacity FNN Week4 27jan25

The document discusses key concepts in artificial intelligence and machine learning, focusing on the Perceptron model and its training algorithm. It explains the relationship between monotonicity and admissibility in heuristics, as well as the capacity of perceptrons to compute functions based on the number of regions created by hyperplanes. Additionally, it provides insights into the convergence of the Perceptron Training Algorithm (PTA) when applied to linearly separable functions.

Uploaded by

sanchita.iitb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views59 pages

Cs217 Perceptron Capacity FNN Week4 27jan25

The document discusses key concepts in artificial intelligence and machine learning, focusing on the Perceptron model and its training algorithm. It explains the relationship between monotonicity and admissibility in heuristics, as well as the capacity of perceptrons to compute functions based on the number of regions created by hyperplanes. Additionally, it provides insights into the convergence of the Perceptron Training Algorithm (PTA) when applied to linearly separable functions.

Uploaded by

sanchita.iitb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

CS217: Artificial Intelligence and

Machine Learning
(associated lab: CS240)

Pushpak Bhattacharyya
CSE Dept.,
IIT Bombay
Week4 of 27jan25, Perceptron capacity,
FFNN
Main points covered: week3 of
20jan25
Monotonicity
 A heuristic h(p) is said to satisfy the
monotone restriction, if for all ‘p’,
h(p)<=h(pc)+cost(p, pc), where ‘pc’ is
the child of ‘p’.
Theorem
 If monotone restriction (also called triangular
inequality) is satisfied, then for nodes in the
closed list, redirection of parent pointer is not
necessary. In other words, if any node ‘n’ is
chosen for expansion from the open list, then
g(n)=g*(n), where g(n) is the cost of the
path from the start node ‘s’ to ‘n’ at that point
of the search when ‘n’ is chosen, and g*(n) is
the cost of the optimal path from ‘s’ to ‘n’
Relationship between
Monotonicity and Admissibility
 Observation:
Monotone Restriction → Admissibility
but not vice-versa
 Statement: If h(ni) <= h(nj) + c(ni, nj)
for all i, j
then h(ni) < = h*(ni) for all i
The Perceptron Model

A perceptron is a computing element with


input lines having associated weights and the
cell having a threshold value. The perceptron
model is motivated by the biological neuron.
Output = y

Threshold = θ

wn w1
Wn-1
x1
Xn-1
Perceptron Training Algorithm
1. Start with a random value of w
ex: <0,0,0…>
2. Test for wxi > 0
If the test succeeds for i=1,2,…n
then return w
3. Modify w, wnext = wprev + xfail
Convergence of PTA
 Statement:
Whatever be the initial choice of weights and
whatever be the vector chosen for testing, PTA
converges if the vectors are from a linearly
separable function.
End of main points
Perceptrons and their computing
power
Fundamental Observation
 The number of TFs computable by a perceptron
is equal to the number of regions produced by
2n hyper-planes,obtained by plugging in the
values <x1,x2,x3,…,xn> in the equation
∑i=1nwixi= θ
The geometrical observation
 Problem: m linear surfaces called hyper-
planes (each hyper-plane is of (d-1)-dim)
in d-dim, then what is the max. no. of
regions produced by their intersection?
i.e., Rm,d = ?
Regions produced by lines
L3
L2 Regions produced by lines
X2 not necessarily passing
L1 through origin
L4 L1: 2

L2: 2+2 = 4
L3: 2+2+3 = 7

L4: 2+2+3+4 =
11

X1

New regions created = Number of intersections on the incoming line


by the original lines
Total number of regions = Original number of regions + New regions
created
Number of computable
functions by a neuron Y
w1* x1  w2 * x 2   Ѳ

(0,0)    0 : P1
w1 w2
(0,1)  w2   : P 2
(1,0)  w1   : P3
x1
(1,1)  w1  w2   : P 4 x2

P1, P2, P3 and P4 are planes in the


<W1,W2, Ѳ> space
Number of computable functions by a neuron
(cont…)

 P1 produces 2 regions
 P2 is intersected by P1 in a line. 2 more new
regions are produced.
Number of regions = 2+2 = 4 P2
 P3 is intersected by P1 and P2 in 2 intersecting
lines. 4 more regions are produced. P3
Number of regions = 4 + 4 = 8
 P4 is intersected by P1, P2 and P3 in 3
intersecting lines. 6 more regions are produced.
P4
Number of regions = 8 + 6 = 14
 Thus, a single neuron can compute 14 Boolean
functions which are linearly separable.
Points in the same region
If X2
W1*X1 + W2*X2 > Ѳ
W1’*X1 + W2’*X2 > Ѳ’
Then
If <W1,W2, Ѳ> and
<W1’,W2’, Ѳ’> share a
region then they X1
compute the same
function
No. of Regions produced by
Hyperplanes
Number of regions founded by n hyperplanes in d-dim passing
through origin is given by the following recurrence relation

Rn, d  Rn  1, d  Rn  1, d  1
we use generating function as an operating function

Boundary condition:
R1, d  2 1 hyperplane in d-dim
n hyperplanes in 1-dim,
Rn ,1  2
Reduce to n points thru origin

The generating function is  


f ( x, y )   Rn , d  x n y d
n 1 d 1
From the recurrence relation we have,
Rn, d  Rn  1, d  Rn  1, d  1  0
Rn-1,d corresponds to ‘shifting’ n by 1 place, => multiplication by x
Rn-1,d-1 corresponds to ‘shifting’ n and d by 1 place => multiplication by xy

On expanding f(x,y) we get

f ( x, y )  R1,1  xy  R1, 2  x y  R1,3  x y  ...  R1, d  x y  .....


2 3 d

 R 2,1  x 2 y  R 2, 2  x 2 y 2  R 2,3  x 2 y 3  ...  R 2, d  x 2 y d  .....


.....
 Rn ,1  x n y  Rn , 2  x n y 2  Rn ,3  x n y 3  ...  Rn , d  x n y d  .....
 
f ( x, y )   Rn , d  x n y d
n 1 d 1
   
x  f ( x, y )   Rn , d  x n 1
y   Rn  1, d  x n y d
d

n 1 d 1 n  2 d 1
   
xy  f ( x, y )   Rn , d  x n 1
y d 1
  Rn  1, d  1  x n y d
n 1 d 1 n2 d 2

  
x  f ( x, y )   Rn  1, d  x y   Rn  1,1  x n y
n d

n2 d 2 n2
  
  Rn  1, d  x y  2   x n y
n d

n2 d 2 n2
 
f ( x, y )   Rn , d  x n y d
n 1 d 1
   
  Rn , d  x n y d   R1, d  xy d   Rn ,1  x n y  R1,1  xy
n2 d 2 d 1 n 1
   
  Rn , d  x y  2 x   xy  2 y   x n y  2 xy
n d d

n2 d 2 d 1 n 1

After all this expansion,


f ( x, y )  x  f ( x, y )  xy  f ( x, y)
 
  ( Rn , d  Rn  1, d  Rn  1, d  1) x n y d
n2 d 2
  
 2 y   x  2 xy  2 y   x  2 x   y d
d

n 1 n2 d 1

 2x   y d since other two terms become zero
d 1
This implies

[1  x  xy] f ( x, y )  2 x   y d
d 1

1
f ( x, y )   2x   yd
[1  x(1  y )] d 1

 2 x  [ y  y 2  y 3  ...  y d  .....] 
[1  x(1  y )  x 2 (1  y ) 2  ...  x d (1  y ) d  .....]

also we have,  
f ( x, y )   Rn , d  x n y d
n 1 d 1

Comparing coefficients of each term in RHS we get,


Comparing co-efficients we get

d 1
Rn, d  2 C i n 1

i 0
Implication
 R(n,d) becomes for a perceptron with m
weights and 1 threshold R(2m,m+1)
m 11
2
2 m 1

i 0
C i

m
 2 C i
2 m 1

i 0

 O(2 ) m2

 Total no of Boolean Function is 22^m. Shows


why #TF << #BF
PTA convergence
Statement of Convergence of
PTA
 Statement:
Whatever be the initial choice of weights and
whatever be the vector chosen for testing, PTA
converges if the vectors are from a linearly
separable function.
Proof of Convergence of PTA

 Suppose wn is the weight vector at the nth


step of the algorithm.
 At the beginning, the weight vector is w0
 Go from wi to wi+1 when a vector Xj fails
the test wiXj > 0 and update wi as
wi+1 = wi + Xj
 Since Xjs form a linearly separable
function,
 w* s.t. w*Xj > 0 j
Proof of Convergence of PTA
(cntd.)
 Consider the expression
G(wn) = wn . w*
| wn|
where wn = weight at nth iteration
 G(wn) = |wn| . |w*| . cos 
|wn|
where  = angle between wn and w*
 G(wn) = |w*| . cos 
 G(wn) ≤ |w*| ( as -1 ≤ cos  ≤ 1)
Behavior of Numerator of G
wn . w* = (wn-1 + Xn-1fail ) . w*
 wn-1 . w* + Xn-1fail . w*
 (wn-2 + Xn-2fail ) . w* + Xn-1fail . w* …..
 w0 . w* + ( X0fail + X1fail +.... + Xn-1fail ). w*
w*.Xifail is always positive: note
carefully
 Suppose |Xj| ≥  , where  is the
minimum magnitude.
 Num of G ≥ |w0 . w*| + n  . |w*|
 So, numerator of G grows with n.
Behavior of Denominator of G
 |wn| =  wn . wn
  (wn-1 + Xn-1fail )2
  (wn-1)2 + 2. wn-1. Xn-1fail + (Xn-1fail )2
≤  (wn-1)2 + (Xn-1fail )2 (as wn-1. Xn-1fail
≤0)
≤  (w0)2 + (X0fail )2 + (X1fail )2 +…. + (Xn-1fail
)2

 |Xj| ≤  (max magnitude)


 So, Denom ≤  (w0)2 + n2
Some Observations
 Numerator of G grows as n
 Denominator of G grows as  n
=> Numerator grows faster than
denominator
 If PTA does not terminate, G(wn) values
will become unbounded.
Some Observations contd.
 But, as |G(wn)| ≤ |w*| which is finite,
this is impossible!
 Hence, PTA has to converge.
 Proof is due to Marvin Minsky.
Feedforward Network and
Backpropagation
Example - XOR
θ = 0.5
w1=1 w2=1
x1x2 1 1 x1x2

1.5
-1 -1
1.5
x1 x2
Gradient Descent Technique
 Let E be the error at the output layer
1 p n
E   (ti  oi ) 2j
2 j 1 i 1

 ti = target output; oi = observed output

 i is the index going over n neurons in the


outermost layer
 j is the index going over the p patterns (1 to p)
 Ex: XOR:– p=4 and n=1
Weights in a FF NN
 wmn is the weight of the m
connection from the nth neuron wmn
to the mth neuron
n
 E vs W surface is a complex
surface in the space defined by
the weights wij
E
 
w gives the direction in E
wmn 
which a movement of the
mn

wmn
operating point in the wmn co-
ordinate space will result in
maximum decrease in error
Backpropagation algorithm
j …. Output layer
wji (m o/p
…. neurons)
i
Hidden layers
….
…. Input layer
(n i/p neurons)

 Fully connected feed forward network


 Pure FF network (no jumping of
connections over layers)
Gradient Descent Equations
E
w ji   (  learning rate, 0    1)
w ji
E E net j
  (net j  input at the jth neuron )
w ji net j w ji
E
 j
net j
net j
w ji  j  joi
w ji
Backpropagation – for
outermost layer
E E  o j
j     (net j  input at the j layer)
th

net j o j net j
1 m
E   (t p  o p ) 2
2 p 1
Hence, j  ((t j  o j )o j (1  o j ))
w ji   (t j  o j )o j (1  o j )oi
Backpropagation for hidden
layers
k …. Output layer
(m o/p
j …. neurons)
Hidden layers
….
i
…. Input layer
(n i/p neurons)

k is propagated backwards to find value of j


Backpropagation – for hidden
layers
w  jo
ji i

E E o j
j    
net j o j net j
E
  o j (1  o j )
o j
E netk
 
knext layer
(
netk

o j
)  o j (1  o j )

Hence,  j    (
knext layer
k  wkj )  o j (1  o j )

  (w 
knext layer
kj k )o j (1  o j )
General Backpropagation Rule
• General weight updating rule:
w ji  joi
• Where

 j  (t j  o j )o j (1  o j ) for outermost layer

  (w 
knext layer
kj k )o j (1  o j )oi for hidden layers
How does it work?
 Input propagation forward and error
propagation backward (e.g. XOR)
θ = 0.5
w1=1 w2=1
x1x2 1 1 x1x2

1.5
-1 -1
1.5
x1 x2
Can Linear Neurons Work?
y  m xc 3 3

h2 h1

y  m xc
2 2 y  m xc 1 1

x2 x1

h  m (w x  w x )  c
1 1 1 1 2 2 1

h  m (w x  w x )  c
1 1 1 1 2 2 1

Out  (w h  w h )  c5 1 6 2 3

k x  k x  k1 1 2 2 3
Note: The whole structure shown in earlier slide is reducible
to a single neuron with given behavior

Out  k x  k x  k
1 1 2 2 3

Claim: A neuron with linear I-O behavior can’t compute X-


OR.
Proof: Considering all possible cases:

[assuming 0.1 and 0.9 as the lower and upper thresholds]


m(w .0  w .0  )  c  0.1
1 2

For (0,0), Zero class:  c  m.  0.1

m(w .1 w .0  )  c  0.9


2 1

For (0,1), One class:  m.w  m.  c  0.9


1
For (1,0), One class: m.w  m.  c  0.9
1

For (1,1), Zero class: m.w  m.  c  0.9


1

These equations are inconsistent. Hence X-OR can’t be computed.

Observations:
1. A linear neuron can’t compute X-OR.
2. A multilayer FFN with linear neurons is collapsible to a
single linear neuron, hence no a additional power
due to hidden layer.
3. Non-linearity is essential for power.
An application in Medical
Domain
Expert System for Skin Diseases
Diagnosis
 Bumpiness and scaliness of skin
 Mostly for symptom gathering and for
developing diagnosis skills
 Not replacing doctor’s diagnosis
Architecture of the FF NN
 96-20-10
 96 input neurons, 20 hidden layer neurons,
10 output neurons
 Inputs: skin disease symptoms and their
parameters
 Location, distribution, shape, arrangement,
pattern, number of lesions, presence of an active
norder, amount of scale, elevation of papuls,
color, altered pigmentation, itching, pustules,
lymphadenopathy, palmer thickening, results of
microscopic examination, presence of herald
pathc, result of dermatology test called KOH
Output
 10 neurons indicative of the diseases:
 psoriasis, pityriasis rubra pilaris, lichen
planus, pityriasis rosea, tinea versicolor,
dermatophytosis, cutaneous T-cell
lymphoma, secondery syphilis, chronic
contact dermatitis, soberrheic dermatitis
Symptoms & parameters Internal
Duration
of lesions : weeks 0 representation
Disease
diagnosis
Duration
0
of lesions : weeks 1

0
Minimal itching
( Psoriasis node )
6
Positive 1.68
KOH test
10
13

5
Lesions located 1.62 (Dermatophytosis node)
on feet
36
14
Minimal
increase
in pigmentation 71
1
Positive test for
9
pseudohyphae
(Seborrheic dermatitis node)
And spores 95 19

Bias Bias
20
96
Figure : Explanation of dermatophytosis diagnosis using the DESKNET expert system.
Training data
 Input specs of 10 model diseases from
250 patients
 0.5 is some specific symptom value is
not known
 Trained using standard error
backpropagation algorithm
Testing
 Previously unused symptom and disease data of 99
patients
 Result:
 Correct diagnosis achieved for 70% of
papulosquamous group skin diseases
 Success rate above 80% for the remaining diseases
except for psoriasis
 psoriasis diagnosed correctly only in 30% of the
cases
 Psoriasis resembles other diseases within the
papulosquamous group of diseases, and is somewhat
difficult even for specialists to recognise.
Explanation capability
 Rule based systems reveal the explicit
path of reasoning through the textual
statements
 Connectionist expert systems reach
conclusions through complex, non linear
and simultaneous interaction of many
units
 Analysing the effect of a single input or a
single group of inputs would be difficult
and would yield incorrect results
Explanation contd.
 The hidden layer re-represents the data
 Outputs of hidden neurons are neither
symtoms nor decisions
Discussion
 Symptoms and parameters contributing
to the diagnosis found from the n/w
 Standard deviation, mean and other
tests of significance used to arrive at
the importance of contributing
parameters
 The n/w acts as apprentice to the
expert

You might also like