Deep
Learning
Deep Feedforward
Networks:
Overview
1
Topics in DFF Networks
1. Overview
2. Example: Learning XOR
3.Hidden Units
4. Architecture Design
5. Backpropagation and Other
Differentiation
6. Historical Notes
2
Deep
Learning
Sub-topics in Overview of
DFF
1. Goal of a Feed-Forward
Network
2. Feedforward vs Recurrent
Networks
3. Function Approximation as
Goal
4. Extending Linear Models (SVM)
5. Example of XOR
3
Deep
Goal of a feedforward
Learning
network
• Feedforward Nets are
quintessential deep learning
models
• Deep Feedforward Networks
are also called as
– Feedforward neural networks or
– Multilayer Perceptrons (MLPs)
• Their Goal is to approximate
some function f *
– E.g., classifier y = f * (x) maps
bestx function
input to category y
4
approximation
– Feedforward Network defines a
Feedforward network for
MNIST
MNIST 28x28
images
Source: https://towardsdatascience.com/probability-and-statistics-explained-in-the-context-of-deep-
learning-ed1509b2eb3f
5
Deep
Learning
Flow of
• Information
Models are called Feedforward y=f (x)
because:
– To evaluate f (x): information flows one-
way from
x through computations defining f s to outputs
y
• There are no feedback connections
– No outputs of model are fed back into
itself
6
Deep
Feedforward Net: US
Learning
• USElection
Presidential Election y=f (x)
• Output: y={y1, y2}
• votes of electoral college for
candidate
• Input: X={x1,..x50}
• are vote vectors cast for 2 candidates
• W converts votes to electoral
• h is electoral college votesh is defined for each
state as shown in map
– E.g., Winner
• Each takes
state hasall or proportionate
fixed no of
electors
• w maps 50 states to 2
outputs 7
• Simple addition
Importance of Feedforward
Networks
• They are extremely important to ML
practice
• Form basis for many commercial
applications
1. CNNs are a special kind of feedforward
networks
• They are used for recognizing objects from
photos
2. They are a conceptual stepping stones to
RNNs
• RNNs power many NLP applications 8
Deep
Learning
Feedforward vs.
Recurrent
• When feedforward neural networks are
extended to include feedback
connections they are called Recurrent
Neural Networks (RNNs)
RNN Unrolled
RNN RNN with
learning
compone
nt
9
Deep
Learning
Feedforward Neural Network
Structures
• They are called networks because
they are composed of many different
functions
• Model is associated with a directed
acyclic graph describing how
functions composed
– E.g., functions f (1), f (2), f (3) connected in a
chain to form f (x)= f (3) [ f (2) [ f (1)(x)]]
• f (1) is called the first layer of network (which is a
vector)
10
• f (2) is called the second layer, etc
• These chain structures are the most
Definition of Depth
• Overall length of the chain is the depth
of the model
– Ex: the composite function f (x)= f (3) [ f (2) [ f
(1)(x)]]
has depth of 3
• The name deep learning arises from
this terminology
• Final layer of a feedforward network, ex
f (3), is called the output layer
11
Training the Network
• In network training we drive f (x) to
match f* (x)
• Training data provides us with noisy,
approximate examples of f* (x)
evaluated at different training points
• Each example accompanied by label y
≈ f*(x)
• Training examples specify directly
what the output layer must do at
each point x 12
– It must produce a value that is close
Definition of Hidden
Layer
• Behavior of other layers is not directly
specified by the data
• Learning algorithm must decide how to
use those layers to produce value that
is close to y
• Training data does not say what
individual layers should do
• Since the desired output for these
layers is not shown, they are called
hidden layers 13
Deep Learning
Srihari
A net with depth 2: one hidden
layer
K outputs y1,..yK for a given
input x
Hidden layer consists of M
units
M (2) D (1)
y k(x,w) wkj wji x i (1)
(2)
j0
j h
i
1
w w k0
1
f (x)= f (2) [ f (1)(x)]
f (1) is a vector of M dimensions
and
f (2) is a vector of K dimensions
fm (1) =zm= h(xTw(1)), m=1,..M
fk (2) = σ (zTw(2)), k=1,..K
15
Feedforward net with
depth 2 of printed characters
• Recognition
(OCR)
f (x)= f (2) [ f (1)(x)]
– Hidden layer f (1) compares raw pixel
inputs to component patterns
15
Width of Model
• Each hidden layer is typically vector-
valued
• Dimensionality of hidden layer vector is
width of the model
16
Units of a model
• Each element of vector viewed as a
neuron
– Instead of thinking of it as a vector-vector
function, they are regarded as units in
parallel
• Each unit receives inputs from many
other units and computes its own
activation value
17
Depth versus Width
• Going deeper makes network more
expressive
– It can capture variations of the data better.
– Yields expressiveness more efficiently than
width
• Tradeoff for more expressiveness is
increased tendency to overfit
– You will need more data or additional
regularization
• network should be as deep as training data
allows.
– But you can only determine a suitable
Deep
Learning
Why are they neural
• networks?
These networks are loosely
inspired by neuroscience
• Each unit resembles a neuron
– Receives input from many other
units
– Computes its own activation value
• Choice of functions f (i)(x):
– Loosely guided by neuroscientific
observations about biological neurons
• Modern neural networks are guided by
many mathematical and engineering
19
disciplines
• Not perfectly model the brain
Deep
Learning
Function Approximation is
• goalof feedforward networks as
Think
function approximation machines
– Designed to achieve statistical
generalization
• Occasionally draw insights from what
we know about the brain
– Rather than as models of brain function
20
Understanding Feedforward
Nets
• Begin with linear networks and
understand their limitations
• Linear models such as logistic
regression and linear regression can be
fit reliably and efficiently using either
– Closed-form solution
– Convex optimization
• Limitation
21
Extending Linear Models
• To represent non-linear functions of x
– apply linear model to transformed input ϕ(x)
• where ϕ is non-linear
– Equivalently kernel trick of SVM obtains
nonlinearity
SVM Kernel
Deep
Learning
• Many ML trick
algos can be rewritten
as dot products between
examples:
f (x)=wTx+b written as b + Σi αi xTx(i)
where x(i) is a training example and α is a vector of
coeffts
– This allows us to replace x with a feature function
ϕ(x) and dot product with function
k(x,x(i))=ϕ(x)ϕ(x(i)) called a kernel
• The operator represents an inner product analogous to
ϕ(x)Tϕ(x(i))
• For some feature spaces we may not literally use an inner
product
– In continuous spaces an inner product based on integration
– Gaussian kernel
• Consider k(u,v) = exp (-||u-v||2/2σ2)
SVM
Deep
Learning
• Prediction
Use linear regression on
Lagrangian for determining the
weights αi
• We can make predictions using
– f (x)= b + Σiαi k(x,x(i))
– Function is nonlinear wrt x but
relationship between
ϕ(x) and f (x) is linear
– Also the relationship between α and f (x)
is linear
– We can think of ϕ as providing a set of
features
• describing x or providing a new
Disadvantages of Kernel
Methods
• Cost of decision function evaluation:
linear in m
– Because the ith example contributes term αi k(x,
x(i))
to the decision function
– Can mitigate this by learning an α with
mostly zeros
• Classification requires evaluating the kernel
function only for training examples that have
a nonzero αi
• These are known as support vectors
• Cost of training: high with large data 25
sets
Options for choosing
mapping ϕ
1. Generic feature function ϕ (x)
– Radial basis function
2. Manually engineer ϕ
– Feature engineering
3. Principle of Deep Learning:
Learn ϕ
26
Option 1 to choose the
• mapping
Generic feature function ϕ (x) ϕ
– Infinite-dimensional ϕ that is implicitly
used by kernel machines based on
RBF
• RBF: N(x ; x(i), σ2I) centered at x(i) σ =mean
x : From
(i) distance
k-means between
clusterin each unit j and
g its
closest
neighbor
– If ϕ(x) is of high enough dimension we can
have enough capacity to fit the
training set
• Generalization to test set remains poor
• Generic feature mappings are based on 27
smoothness
– Do not include prior information to solve advanced
Deep
Learning
Option 2 to choose the
•
mapping ϕ
Manually engineer ϕ
• This was the dominant approach until
arrival of deep learning
• Requires decades of effort
– e.g., speech recognition, computer vision
• Little transfer between domains
28
Option 3 to choose the
mapping ϕ
• Strategy of Deep learning: Learn ϕ
• Model is y=f (x; θ,w) = ϕ(x; θ)T w
– θ used to learn ϕ from broad class of
functions
– Parameters w map from ϕ (x) to output
– Defines FFN where ϕ define a hidden
layer
• Unlike other two (basis functions,
manual engineering), this approach
gives-up on convexity of training
29
– But its benefits outweigh harms
Deep
Learning
Extend Linear Methods to Learn
ϕ ϕM K outputs y1,..yK for a given
θMD input x
wKM
Hidden layer consists of M
units M ⎛D
k ∑ kj j ⎜∑ ji i
⎠⎟
j =1 ⎞⎝ i=1 j0 k
y (x; θ,w) = w φ⎜ θ x + θ ⎟+ w
0
ϕ1 w10 yk = fk (x;θ,w) = ϕ (x;θ)T w
ϕ0
Can be viewed as a generalization of linear models
• Nonlinear function fk with M+1 parameters wk= (wk0 ,..wkM )
with
• M basis functions, ϕj j=1,..M each with D parameters θj=
(θj1,..θjD)
• Both wk and θj are learnt from data
32
Approaches to
• Learning
Parameterize theϕbasis functions as
ϕ(x;θ)
– Use optimization to find θ that
corresponds to a good representation
• Approach can capture benefit of first
approach (fixed basis functions) by
being highly generic
– By using a broad family for ϕ(x;θ)
• Can also capture benefits of second
approach
– Human practitioners design families of 3
3
ϕ(x;θ) that will perform well
Importance of
• Learning
Learning ϕ
ϕ is discussed beyond
this first introduction to feed-
forward networks
– It is a recurring theme throughout deep
learning applicable to all kinds of
models
• Feedforward networks are application
of this principle to learning
deterministic mappings form x to y
without feedback
• Applicable to
– learning stochastic mappings
Plan of Discussion: Feedforward
Networks
1. A simple example: learning XOR
2. Design decisions for a feedforward
network
– Many are same as for designing a linear
model
• Basics of gradient descent
– Choosing the optimizer, Cost function, Form of output
units
– Some are unique
• Concept of hidden layer
– Makes it necessary to have activation functions
• Architecture of network
– Backpropagation and modern
– 3
How many layers , How are they connected to each
generalizations
other, How many units in each later
5
Deep
1. Ex: XOR
Learning
• XOR: an problem
operation on binary variables x1
and x2
– When exactly one value equals 1 it returns 1
otherwise it returns 0
– Target function is y=f *(x) that we want to
learn
• Our model is y =f ([x1, x2] ; θ) which we learn, i.e.,
adapt parameters θ to make it similar to f *
• Not concerned with statistical
generalization
– Perform correctly on four training points:
•• X={[0,0]
f ([0,1]T;T,θ)[0,1]
= T,[1,0]
f ([1,0] T; θ)
T, [1,1] T} 3
=1 6
– Challenge is to fit the training set
ML for XOR: linear model
•doesn’t
Treat it asfit
regression with MSE loss
function
J(θ) =
4
1
∑ (f *(x) − f (x;θ)) = 4
2 1∑
4
(f *(x n ) − f n
2
(x ;θ)x∈X n=1
)
– Usually not used for binary
Alternative is Cross-entropy
J(θ)
J(θ) = − l nN p(t | θ)
data ∑{ = − t n ln yn +(1 − t n )ln(1 − y n )}
– But math is simple
n=1
• We must choose the form of the
yn= σ (θTxn)
model
• Consider af linear
(x;w,b) = xmodel
w with θ ={w,b}T
+b
where
– J(θ) = 4 ∑ (
1
4
t −x w - to get closed-form
n
T
n
2
n=1 b)
Minimize ) w andsolution
• Differentiate wrt b to obtain w = 0 and b=½
– Then the linear model f(x;w,b)=½ simply outputs 0.5
everywhere
– Why does this 3
7
happen?
Linear model cannot solve
• XOR
Bold numbers are values system must
output
• When x1=0, output has to increase with x2
• When x1=1, output has to decrease with x2
• Linear model f (x;w,b)= x1w1+x2w2+b has to assign a
single weight to x2, so it cannot solve this
problem
• A better solution:
– use a model to learn a different
representation
• in which a linear model is able to represent the
solution 36
– We use a simple feedforward network
Deep
Learning
Feedforward Network for
XOR
• Introduce a simple
feedforward network
– with one hidden layer
containing two units
• Same network drawn in two
different styles
– Matrix W describes mapping from
x to h
– Vector w describes mapping from
h to y 37
– Intercept parameters b are
Functions computed by
• Network
Layer 1 (hidden layer): vector of
hidden units h computed by
function f (1)(x; W,c)
– c are bias variables
• Layer 2 (output layer) computes
f (2)(h; w,b)
– w are linear regression weights
– Output is linear regression applied to
h
rather than to x
• Complete model is 38
(2) (1)
Linear vs Nonlinear
• functions
If we choose both f (1) and f (2) to be
linear, the total function will still be
linear f (x)=xTw’
– Suppose
Then we could f (1)(x)= WTx and
represent f (2)(h)=hTw
this
f (x)=xTw’
function as
f (x)=x Tw’ where w’=Ww
• Since linear is insufficient, we must
use a nonlinear function to describe
the features
– We use the strategy of neural networks
– by using a nonlinear activation function
41
h=g(WTx+c)
Activation
• In linear Function
regression we used a vector of
w and scalar
weights f (x;w,b) = x w
T
+b
bias b
– to describe an affine transformation from
an input vector to an output scalar
• Now we describe an affine
transformation from a vector x to a
vector h, so an entire vector of bias
parameters is needed
• Activation function g is typically
chosen to be applied element-wise
hi=g(xTW:,i+ci) 4
2
Deep
Learning
Default Activation
• Function
Activation: g(z)=max{0,z}
– Applying this to the
output of a linear
transformation yields a
nonlinearfunction
– However transformation
remains A principle of CS:
close to linear Build complicated
systems from
• Piecewise linear with two minimal
pieces components.
A Turing Machine
• Therefore preserve properties Memory needs
that make linear models only 0 and 1 states.
easy to optimize with
We can build
gradient-based methods Universal Function
• Preserve many properties approximator from
ReLUs
that make linear models
generalize
Specifying the Network using
•ReLU
Activation: g(z)=max{0,z}
• We can now specify the complete
network as
f (x; W,c,w,b)=f (2)(f (1)(x))=wT max {0,WTx+c}+b
We can now specify XOR
• Solution
Le
⎡
W =⎢1
⎢
f (x; W,c,w,b)=
⎥
1 ⎤, c =
⎢ 0
⎤
⎥
⎡
w max {0,W x+c}+b ⎢ 1
⎤
⎥, b = T T
⎣ 1 ⎦ ⎣⎢ − 1 ⎦, w=⎢
⎣ −2 0
• tNow walk through how model
⎦
1
⎡
⎥ ⎥ ⎥
batch of a
processes ⎡ ⎤
⎥
⎢⎢ 0 0 ⎥
inputs
• Design matrix X of all four ⎡
⎢ 0 0
⎢
⎤X =⎢ 0 1
⎥⎢
⎥
⎢
⎢ 1 0
⎥⎥
⎥
⎥
• points:
First step is ⎡
⎢ 0 −1 ⎤XW = ⎢
1 1 ⎥
⎥ ⎢⎣ 1 1 ⎥
⎢ ⎥ ⎢ ⎦
In this space all points ⎢
• XW:
Adding liealong a line with slope 1. XW + c =
⎢ 1 ⎥
⎢ 1
0⎢⎥
⎥
⎥ ⎣
1 1 ⎥
⎥
implemented
Cannot be by a linear ⎢ ⎥ 2 2
⎢ ⎢
• c:
Compute h Using
model ⎡
⎢⎢ 0 0
⎤
⎥ ⎢
0
2 1
⎦
⎥
1 0 ⎥ ⎣
ReLU
Has changed relationship among max{0, X W + c} = ⎢⎥
⎢ ⎢
⎢
1 0
⎥
⎥ ⎥
⎥ ⎦
examples. They no longer lie on a
⎢ 2 1
A linear
single model
line. ⎣
• Finish by multiplying
suffices ⎦
⎥
⎡ ⎤
• by w:
⎢ 0 ⎥
Network has ⎢
f (x) = ⎢
⎢
1
⎢
⎥
⎥
⎥
⎥
obtained
⎢ ⎥
⎣ ⎦
1
⎢ ⎥
0
correct answer for all 4
examples 43
Learned representation for
• XOR
Two points that must
When x =0, output
have output 1 have
1
has to increase with
x2
been collapsed into When x1=1, output
one has to decrease with
x2
• Points x=[0,1]T
and x=[1,0]T have
been mapped
When h1=0, output is
• into h=[0,1] T
Described in linear constant 0 with h2
When h1=1, output is
model constant 1 with h2
When h1=2, output is
– For fixed h2, 1 constant 0
with h2 44
output increases
in h
Deep
About the XOR
Learning
• example
We simply specified the solution
– Then showed that it achieves zero error
• In real situations there might be
billions of parameters and billions of
training examples
– So one cannot simply guess the solution
• Instead gradient descent optimization
can find parameters that produce very
little error
– The solution described is at the global
minimum 45
• Gradient descent could converge to this
solution
Learning XOR
Learning XOR
XOR cant be calculated by a single
perceptron