Introduction To Neural Networks: Deep Learning For NLP
Introduction To Neural Networks: Deep Learning For NLP
Kevin Patel
ICON 2017
1 Motivation
2 Perceptron
4 Deep Learning
5 Conclusion
Perceptron
Perceptron (Contd.)
x1
w1
x2 w2 y
w3
x3
{ ∑
1, if wi xi > threshold
y=
0 otherwise
Some Conventions
Bias
Perceptron Example
Should I go to lab?
x1 : My guide is here
x2 : Collaborators are in lab
x3 : The buses are running
x1 x4 : Tasty tiffin in the mess
10 b: My inclination towards
x2 going to the lab no matter
2 y
2 what
x3
-3
x4
Perceptron Example
Should I go to lab?
x1 : My guide is here
x2 : Collaborators are in lab
x3 : The buses are running
x1 x4 : Tasty tiffin in the mess
10 b: My inclination towards
x2 going to the lab no matter
2 y
2 what
x3
-3 What if b = −3?
x4
Perceptron Example
Should I go to lab?
x1 : My guide is here
x2 : Collaborators are in lab
x3 : The buses are running
x1 x4 : Tasty tiffin in the mess
10 b: My inclination towards
x2 going to the lab no matter
2 y
2 what
x3
-3 What if b = −3?
x4 What if b = 1?
Network of Perceptrons
w111
x1 w112
21 w211
w122
31
13 w212
x2 w123
41
32 w213 y
w133
42
51 w214
x3 w143
52 w215
w153
x1 x1
2 2
−1 −3
x2 2 x2 2
NAND Gate
NOT Gate
x1
-2
x1 3
-2 1 x2 -2
XOR Gate
x1 −2 3
−2 −2 −2
3 3
−2 −2 −2
x2 −2 3
Learning
Learning (contd.)
If y is a function of x, then change in y i.e. ∆y is related to
change in x i.e. ∆y as follows (Linear Approximation)
dy
∆y ≈ ∆x
dx
Example
f(x) = x2
f′ (x) = 2x
f(4.01) ≈ f(4) + f′ (4)(4.01 − 4)
= 16 + 2 × 4 × 0.01
= 16.08
Learning (contd.)
Sigmoid Neurons
Activation Functions
Linear Sigmoid
Tanh ReLU
Kevin Patel Neural Networks 21/53
Motivation Perceptron Feed Forward Neural Deep Learning Conclusion
Networks
Notations
x: Input
wlij : weight from jth neuron in (l − 1)th layer to ith neuron in
lth layer
blj : bias of the jt h neuron in the lth layer
zlj : wl .al−1 + blj
alj : f(zlj )
w111
x1 w211
w112
21
w212
21
y1
w131
22
13
x2 w213
22
w123
41
32
w214
23
y2
1
w42
33
x3 w224
w143
Given input x
z1 = w1 .x + b1
a1 = σ(z1 )
zl = wl .al−1 + b1
al = σ(zl )
aL is the output, where L is the last layer
Note that output contains real numbers (due to σ function)
Loss functions
Consider a network with parameter setting P1
True Predicted Correct
0 1 0 0.3 0.4 0.3 yes
1 0 0 0.1 0.2 0.7 no
0 0 1 0.3 0.3 0.4 yes
2
Number of correctly classified examples = 3
Classification error = 1 − 23 = 13
Consider same network with parameter setting P2
True Predicted Correct
0 1 0 0.1 0.7 0.2 yes
1 0 0 0.3 0.4 0.3 no
0 0 1 0.1 0.2 0.7 yes
Loss functions
∑
Mean Squared Error : MSE = 1
M (yi − ti )2
Loss functions
Mean Cross∑Entropy
MCE = M1
(−ti logyi − (1 − ti )log(1 − yi ))
Minimizing Loss
Consider a function C that depends on some parameter x as
shown below:
Gradient Descent
Recall ∆C ≈ dC
dx .∆x
We want to change x such that C is reduced i.e. ∆C has to
be always negative
What if we choose ∆x = −η dC
dx ?
dC
∆C ≈ .∆x
dx
dC dC
= .(−η )
dx dx
dC 2
= −η.( )
dx
≤0
Gradient Descent
Back Propagation
∂C
wkij ← wkij − η
∂wkij
δ L = ∇a C ⊙ σ ′ (zL )
Deep Learning
x2
x3
x4
x5
SGD
SGD + Momentum
Other Optimizers
Conclusion
References I
References II
References III
Thank You