CS217: Artificial Intelligence and
Machine Learning
(associated lab: CS240)
Pushpak Bhattacharyya
CSE Dept.,
IIT Bombay
Week4 of 27jan25, Perceptron capacity,
FFNN
Main points covered: week3 of
20jan25
Monotonicity
A heuristic h(p) is said to satisfy the
monotone restriction, if for all ‘p’,
h(p)<=h(pc)+cost(p, pc), where ‘pc’ is
the child of ‘p’.
Theorem
If monotone restriction (also called triangular
inequality) is satisfied, then for nodes in the
closed list, redirection of parent pointer is not
necessary. In other words, if any node ‘n’ is
chosen for expansion from the open list, then
g(n)=g*(n), where g(n) is the cost of the
path from the start node ‘s’ to ‘n’ at that point
of the search when ‘n’ is chosen, and g*(n) is
the cost of the optimal path from ‘s’ to ‘n’
Relationship between
Monotonicity and Admissibility
Observation:
Monotone Restriction → Admissibility
but not vice-versa
Statement: If h(ni) <= h(nj) + c(ni, nj)
for all i, j
then h(ni) < = h*(ni) for all i
The Perceptron Model
A perceptron is a computing element with
input lines having associated weights and the
cell having a threshold value. The perceptron
model is motivated by the biological neuron.
Output = y
Threshold = θ
wn w1
Wn-1
x1
Xn-1
Perceptron Training Algorithm
1. Start with a random value of w
ex: <0,0,0…>
2. Test for wxi > 0
If the test succeeds for i=1,2,…n
then return w
3. Modify w, wnext = wprev + xfail
Convergence of PTA
Statement:
Whatever be the initial choice of weights and
whatever be the vector chosen for testing, PTA
converges if the vectors are from a linearly
separable function.
End of main points
Perceptrons and their computing
power
Fundamental Observation
The number of TFs computable by a perceptron
is equal to the number of regions produced by
2n hyper-planes,obtained by plugging in the
values <x1,x2,x3,…,xn> in the equation
∑i=1nwixi= θ
The geometrical observation
Problem: m linear surfaces called hyper-
planes (each hyper-plane is of (d-1)-dim)
in d-dim, then what is the max. no. of
regions produced by their intersection?
i.e., Rm,d = ?
Regions produced by lines
L3
L2 Regions produced by lines
X2 not necessarily passing
L1 through origin
L4 L1: 2
L2: 2+2 = 4
L3: 2+2+3 = 7
L4: 2+2+3+4 =
11
X1
New regions created = Number of intersections on the incoming line
by the original lines
Total number of regions = Original number of regions + New regions
created
Number of computable
functions by a neuron Y
w1* x1 w2 * x 2 Ѳ
(0,0) 0 : P1
w1 w2
(0,1) w2 : P 2
(1,0) w1 : P3
x1
(1,1) w1 w2 : P 4 x2
P1, P2, P3 and P4 are planes in the
<W1,W2, Ѳ> space
Number of computable functions by a neuron
(cont…)
P1 produces 2 regions
P2 is intersected by P1 in a line. 2 more new
regions are produced.
Number of regions = 2+2 = 4 P2
P3 is intersected by P1 and P2 in 2 intersecting
lines. 4 more regions are produced. P3
Number of regions = 4 + 4 = 8
P4 is intersected by P1, P2 and P3 in 3
intersecting lines. 6 more regions are produced.
P4
Number of regions = 8 + 6 = 14
Thus, a single neuron can compute 14 Boolean
functions which are linearly separable.
Points in the same region
If X2
W1*X1 + W2*X2 > Ѳ
W1’*X1 + W2’*X2 > Ѳ’
Then
If <W1,W2, Ѳ> and
<W1’,W2’, Ѳ’> share a
region then they X1
compute the same
function
No. of Regions produced by
Hyperplanes
Number of regions founded by n hyperplanes in d-dim passing
through origin is given by the following recurrence relation
Rn, d Rn 1, d Rn 1, d 1
we use generating function as an operating function
Boundary condition:
R1, d 2 1 hyperplane in d-dim
n hyperplanes in 1-dim,
Rn ,1 2
Reduce to n points thru origin
The generating function is
f ( x, y ) Rn , d x n y d
n 1 d 1
From the recurrence relation we have,
Rn, d Rn 1, d Rn 1, d 1 0
Rn-1,d corresponds to ‘shifting’ n by 1 place, => multiplication by x
Rn-1,d-1 corresponds to ‘shifting’ n and d by 1 place => multiplication by xy
On expanding f(x,y) we get
f ( x, y ) R1,1 xy R1, 2 x y R1,3 x y ... R1, d x y .....
2 3 d
R 2,1 x 2 y R 2, 2 x 2 y 2 R 2,3 x 2 y 3 ... R 2, d x 2 y d .....
.....
Rn ,1 x n y Rn , 2 x n y 2 Rn ,3 x n y 3 ... Rn , d x n y d .....
f ( x, y ) Rn , d x n y d
n 1 d 1
x f ( x, y ) Rn , d x n 1
y Rn 1, d x n y d
d
n 1 d 1 n 2 d 1
xy f ( x, y ) Rn , d x n 1
y d 1
Rn 1, d 1 x n y d
n 1 d 1 n2 d 2
x f ( x, y ) Rn 1, d x y Rn 1,1 x n y
n d
n2 d 2 n2
Rn 1, d x y 2 x n y
n d
n2 d 2 n2
f ( x, y ) Rn , d x n y d
n 1 d 1
Rn , d x n y d R1, d xy d Rn ,1 x n y R1,1 xy
n2 d 2 d 1 n 1
Rn , d x y 2 x xy 2 y x n y 2 xy
n d d
n2 d 2 d 1 n 1
After all this expansion,
f ( x, y ) x f ( x, y ) xy f ( x, y)
( Rn , d Rn 1, d Rn 1, d 1) x n y d
n2 d 2
2 y x 2 xy 2 y x 2 x y d
d
n 1 n2 d 1
2x y d since other two terms become zero
d 1
This implies
[1 x xy] f ( x, y ) 2 x y d
d 1
1
f ( x, y ) 2x yd
[1 x(1 y )] d 1
2 x [ y y 2 y 3 ... y d .....]
[1 x(1 y ) x 2 (1 y ) 2 ... x d (1 y ) d .....]
also we have,
f ( x, y ) Rn , d x n y d
n 1 d 1
Comparing coefficients of each term in RHS we get,
Comparing co-efficients we get
d 1
Rn, d 2 C i n 1
i 0
Implication
R(n,d) becomes for a perceptron with m
weights and 1 threshold R(2m,m+1)
m 11
2
2 m 1
i 0
C i
m
2 C i
2 m 1
i 0
O(2 ) m2
Total no of Boolean Function is 22^m. Shows
why #TF << #BF
PTA convergence
Statement of Convergence of
PTA
Statement:
Whatever be the initial choice of weights and
whatever be the vector chosen for testing, PTA
converges if the vectors are from a linearly
separable function.
Proof of Convergence of PTA
Suppose wn is the weight vector at the nth
step of the algorithm.
At the beginning, the weight vector is w0
Go from wi to wi+1 when a vector Xj fails
the test wiXj > 0 and update wi as
wi+1 = wi + Xj
Since Xjs form a linearly separable
function,
w* s.t. w*Xj > 0 j
Proof of Convergence of PTA
(cntd.)
Consider the expression
G(wn) = wn . w*
| wn|
where wn = weight at nth iteration
G(wn) = |wn| . |w*| . cos
|wn|
where = angle between wn and w*
G(wn) = |w*| . cos
G(wn) ≤ |w*| ( as -1 ≤ cos ≤ 1)
Behavior of Numerator of G
wn . w* = (wn-1 + Xn-1fail ) . w*
wn-1 . w* + Xn-1fail . w*
(wn-2 + Xn-2fail ) . w* + Xn-1fail . w* …..
w0 . w* + ( X0fail + X1fail +.... + Xn-1fail ). w*
w*.Xifail is always positive: note
carefully
Suppose |Xj| ≥ , where is the
minimum magnitude.
Num of G ≥ |w0 . w*| + n . |w*|
So, numerator of G grows with n.
Behavior of Denominator of G
|wn| = wn . wn
(wn-1 + Xn-1fail )2
(wn-1)2 + 2. wn-1. Xn-1fail + (Xn-1fail )2
≤ (wn-1)2 + (Xn-1fail )2 (as wn-1. Xn-1fail
≤0)
≤ (w0)2 + (X0fail )2 + (X1fail )2 +…. + (Xn-1fail
)2
|Xj| ≤ (max magnitude)
So, Denom ≤ (w0)2 + n2
Some Observations
Numerator of G grows as n
Denominator of G grows as n
=> Numerator grows faster than
denominator
If PTA does not terminate, G(wn) values
will become unbounded.
Some Observations contd.
But, as |G(wn)| ≤ |w*| which is finite,
this is impossible!
Hence, PTA has to converge.
Proof is due to Marvin Minsky.
Feedforward Network and
Backpropagation
Example - XOR
θ = 0.5
w1=1 w2=1
x1x2 1 1 x1x2
1.5
-1 -1
1.5
x1 x2
Gradient Descent Technique
Let E be the error at the output layer
1 p n
E (ti oi ) 2j
2 j 1 i 1
ti = target output; oi = observed output
i is the index going over n neurons in the
outermost layer
j is the index going over the p patterns (1 to p)
Ex: XOR:– p=4 and n=1
Weights in a FF NN
wmn is the weight of the m
connection from the nth neuron wmn
to the mth neuron
n
E vs W surface is a complex
surface in the space defined by
the weights wij
E
w gives the direction in E
wmn
which a movement of the
mn
wmn
operating point in the wmn co-
ordinate space will result in
maximum decrease in error
Backpropagation algorithm
j …. Output layer
wji (m o/p
…. neurons)
i
Hidden layers
….
…. Input layer
(n i/p neurons)
Fully connected feed forward network
Pure FF network (no jumping of
connections over layers)
Gradient Descent Equations
E
w ji ( learning rate, 0 1)
w ji
E E net j
(net j input at the jth neuron )
w ji net j w ji
E
j
net j
net j
w ji j joi
w ji
Backpropagation – for
outermost layer
E E o j
j (net j input at the j layer)
th
net j o j net j
1 m
E (t p o p ) 2
2 p 1
Hence, j ((t j o j )o j (1 o j ))
w ji (t j o j )o j (1 o j )oi
Backpropagation for hidden
layers
k …. Output layer
(m o/p
j …. neurons)
Hidden layers
….
i
…. Input layer
(n i/p neurons)
k is propagated backwards to find value of j
Backpropagation – for hidden
layers
w jo
ji i
E E o j
j
net j o j net j
E
o j (1 o j )
o j
E netk
knext layer
(
netk
o j
) o j (1 o j )
Hence, j (
knext layer
k wkj ) o j (1 o j )
(w
knext layer
kj k )o j (1 o j )
General Backpropagation Rule
• General weight updating rule:
w ji joi
• Where
j (t j o j )o j (1 o j ) for outermost layer
(w
knext layer
kj k )o j (1 o j )oi for hidden layers
How does it work?
Input propagation forward and error
propagation backward (e.g. XOR)
θ = 0.5
w1=1 w2=1
x1x2 1 1 x1x2
1.5
-1 -1
1.5
x1 x2
Can Linear Neurons Work?
y m xc 3 3
h2 h1
y m xc
2 2 y m xc 1 1
x2 x1
h m (w x w x ) c
1 1 1 1 2 2 1
h m (w x w x ) c
1 1 1 1 2 2 1
Out (w h w h ) c5 1 6 2 3
k x k x k1 1 2 2 3
Note: The whole structure shown in earlier slide is reducible
to a single neuron with given behavior
Out k x k x k
1 1 2 2 3
Claim: A neuron with linear I-O behavior can’t compute X-
OR.
Proof: Considering all possible cases:
[assuming 0.1 and 0.9 as the lower and upper thresholds]
m(w .0 w .0 ) c 0.1
1 2
For (0,0), Zero class: c m. 0.1
m(w .1 w .0 ) c 0.9
2 1
For (0,1), One class: m.w m. c 0.9
1
For (1,0), One class: m.w m. c 0.9
1
For (1,1), Zero class: m.w m. c 0.9
1
These equations are inconsistent. Hence X-OR can’t be computed.
Observations:
1. A linear neuron can’t compute X-OR.
2. A multilayer FFN with linear neurons is collapsible to a
single linear neuron, hence no a additional power
due to hidden layer.
3. Non-linearity is essential for power.
An application in Medical
Domain
Expert System for Skin Diseases
Diagnosis
Bumpiness and scaliness of skin
Mostly for symptom gathering and for
developing diagnosis skills
Not replacing doctor’s diagnosis
Architecture of the FF NN
96-20-10
96 input neurons, 20 hidden layer neurons,
10 output neurons
Inputs: skin disease symptoms and their
parameters
Location, distribution, shape, arrangement,
pattern, number of lesions, presence of an active
norder, amount of scale, elevation of papuls,
color, altered pigmentation, itching, pustules,
lymphadenopathy, palmer thickening, results of
microscopic examination, presence of herald
pathc, result of dermatology test called KOH
Output
10 neurons indicative of the diseases:
psoriasis, pityriasis rubra pilaris, lichen
planus, pityriasis rosea, tinea versicolor,
dermatophytosis, cutaneous T-cell
lymphoma, secondery syphilis, chronic
contact dermatitis, soberrheic dermatitis
Symptoms & parameters Internal
Duration
of lesions : weeks 0 representation
Disease
diagnosis
Duration
0
of lesions : weeks 1
0
Minimal itching
( Psoriasis node )
6
Positive 1.68
KOH test
10
13
5
Lesions located 1.62 (Dermatophytosis node)
on feet
36
14
Minimal
increase
in pigmentation 71
1
Positive test for
9
pseudohyphae
(Seborrheic dermatitis node)
And spores 95 19
Bias Bias
20
96
Figure : Explanation of dermatophytosis diagnosis using the DESKNET expert system.
Training data
Input specs of 10 model diseases from
250 patients
0.5 is some specific symptom value is
not known
Trained using standard error
backpropagation algorithm
Testing
Previously unused symptom and disease data of 99
patients
Result:
Correct diagnosis achieved for 70% of
papulosquamous group skin diseases
Success rate above 80% for the remaining diseases
except for psoriasis
psoriasis diagnosed correctly only in 30% of the
cases
Psoriasis resembles other diseases within the
papulosquamous group of diseases, and is somewhat
difficult even for specialists to recognise.
Explanation capability
Rule based systems reveal the explicit
path of reasoning through the textual
statements
Connectionist expert systems reach
conclusions through complex, non linear
and simultaneous interaction of many
units
Analysing the effect of a single input or a
single group of inputs would be difficult
and would yield incorrect results
Explanation contd.
The hidden layer re-represents the data
Outputs of hidden neurons are neither
symtoms nor decisions
Discussion
Symptoms and parameters contributing
to the diagnosis found from the n/w
Standard deviation, mean and other
tests of significance used to arrive at
the importance of contributing
parameters
The n/w acts as apprentice to the
expert