0% found this document useful (0 votes)
23 views48 pages

6.1 DeepFFNets M2

The document provides an overview of Deep Feedforward Networks (DFF), discussing their architecture, training processes, and importance in machine learning. It highlights the distinction between feedforward and recurrent networks, the role of hidden layers, and the concept of function approximation. Additionally, it addresses the challenges of learning non-linear functions and the significance of learning feature representations in deep learning models.

Uploaded by

yashbnv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views48 pages

6.1 DeepFFNets M2

The document provides an overview of Deep Feedforward Networks (DFF), discussing their architecture, training processes, and importance in machine learning. It highlights the distinction between feedforward and recurrent networks, the role of hidden layers, and the concept of function approximation. Additionally, it addresses the challenges of learning non-linear functions and the significance of learning feature representations in deep learning models.

Uploaded by

yashbnv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 48

Deep

Learning

Deep Feedforward
Networks:
Overview

1
Topics in DFF Networks
1. Overview
2. Example: Learning XOR
3.Hidden Units
4. Architecture Design
5. Backpropagation and Other
Differentiation
6. Historical Notes

2
Deep
Learning

Sub-topics in Overview of
DFF
1. Goal of a Feed-Forward
Network
2. Feedforward vs Recurrent
Networks
3. Function Approximation as
Goal
4. Extending Linear Models (SVM)
5. Example of XOR
3
Deep

Goal of a feedforward
Learning

network
• Feedforward Nets are
quintessential deep learning
models
• Deep Feedforward Networks
are also called as
– Feedforward neural networks or
– Multilayer Perceptrons (MLPs)
• Their Goal is to approximate
some function f *
– E.g., classifier y = f * (x) maps
bestx function
input to category y
4

approximation
– Feedforward Network defines a
Feedforward network for
MNIST
MNIST 28x28
images

Source: https://towardsdatascience.com/probability-and-statistics-explained-in-the-context-of-deep-
learning-ed1509b2eb3f
5
Deep
Learning
Flow of
• Information
Models are called Feedforward y=f (x)
because:
– To evaluate f (x): information flows one-
way from
x through computations defining f s to outputs
y
• There are no feedback connections
– No outputs of model are fed back into
itself

6
Deep

Feedforward Net: US
Learning

• USElection
Presidential Election y=f (x)
• Output: y={y1, y2}
• votes of electoral college for
candidate
• Input: X={x1,..x50}
• are vote vectors cast for 2 candidates
• W converts votes to electoral
• h is electoral college votesh is defined for each
state as shown in map
– E.g., Winner
• Each takes
state hasall or proportionate
fixed no of
electors
• w maps 50 states to 2
outputs 7
• Simple addition
Importance of Feedforward
Networks
• They are extremely important to ML
practice
• Form basis for many commercial
applications
1. CNNs are a special kind of feedforward
networks
• They are used for recognizing objects from
photos
2. They are a conceptual stepping stones to
RNNs
• RNNs power many NLP applications 8
Deep
Learning

Feedforward vs.
Recurrent
• When feedforward neural networks are
extended to include feedback
connections they are called Recurrent
Neural Networks (RNNs)
RNN Unrolled
RNN RNN with
learning
compone
nt

9
Deep
Learning

Feedforward Neural Network


Structures
• They are called networks because
they are composed of many different
functions
• Model is associated with a directed
acyclic graph describing how
functions composed
– E.g., functions f (1), f (2), f (3) connected in a
chain to form f (x)= f (3) [ f (2) [ f (1)(x)]]
• f (1) is called the first layer of network (which is a
vector)
10
• f (2) is called the second layer, etc
• These chain structures are the most
Definition of Depth
• Overall length of the chain is the depth
of the model
– Ex: the composite function f (x)= f (3) [ f (2) [ f
(1)(x)]]

has depth of 3
• The name deep learning arises from
this terminology
• Final layer of a feedforward network, ex
f (3), is called the output layer
11
Training the Network
• In network training we drive f (x) to
match f* (x)
• Training data provides us with noisy,
approximate examples of f* (x)
evaluated at different training points
• Each example accompanied by label y
≈ f*(x)
• Training examples specify directly
what the output layer must do at
each point x 12

– It must produce a value that is close


Definition of Hidden
Layer
• Behavior of other layers is not directly
specified by the data
• Learning algorithm must decide how to
use those layers to produce value that
is close to y
• Training data does not say what
individual layers should do
• Since the desired output for these
layers is not shown, they are called
hidden layers 13
Deep Learning
Srihari

A net with depth 2: one hidden


layer
K outputs y1,..yK for a given
input x
Hidden layer consists of M
units
 M (2)  D (1)  
y k(x,w)    wkj  wji x i  (1)
 (2)

j0  
  j h
i
1
w w k0

1

f (x)= f (2) [ f (1)(x)]


f (1) is a vector of M dimensions
and
f (2) is a vector of K dimensions
fm (1) =zm= h(xTw(1)), m=1,..M
fk (2) = σ (zTw(2)), k=1,..K

15
Feedforward net with
depth 2 of printed characters
• Recognition
(OCR)
f (x)= f (2) [ f (1)(x)]
– Hidden layer f (1) compares raw pixel
inputs to component patterns

15
Width of Model
• Each hidden layer is typically vector-
valued
• Dimensionality of hidden layer vector is
width of the model

16
Units of a model
• Each element of vector viewed as a
neuron
– Instead of thinking of it as a vector-vector
function, they are regarded as units in
parallel
• Each unit receives inputs from many
other units and computes its own
activation value

17
Depth versus Width
• Going deeper makes network more
expressive
– It can capture variations of the data better.
– Yields expressiveness more efficiently than
width
• Tradeoff for more expressiveness is
increased tendency to overfit
– You will need more data or additional
regularization
• network should be as deep as training data
allows.
– But you can only determine a suitable
Deep
Learning

Why are they neural


• networks?
These networks are loosely
inspired by neuroscience
• Each unit resembles a neuron
– Receives input from many other
units
– Computes its own activation value
• Choice of functions f (i)(x):
– Loosely guided by neuroscientific
observations about biological neurons
• Modern neural networks are guided by
many mathematical and engineering
19
disciplines
• Not perfectly model the brain
Deep
Learning

Function Approximation is
• goalof feedforward networks as
Think
function approximation machines
– Designed to achieve statistical
generalization
• Occasionally draw insights from what
we know about the brain
– Rather than as models of brain function

20
Understanding Feedforward
Nets
• Begin with linear networks and
understand their limitations
• Linear models such as logistic
regression and linear regression can be
fit reliably and efficiently using either
– Closed-form solution
– Convex optimization
• Limitation

21
Extending Linear Models
• To represent non-linear functions of x
– apply linear model to transformed input ϕ(x)
• where ϕ is non-linear
– Equivalently kernel trick of SVM obtains
nonlinearity
SVM Kernel
Deep
Learning

• Many ML trick
algos can be rewritten
as dot products between
examples:
f (x)=wTx+b written as b + Σi αi xTx(i)
where x(i) is a training example and α is a vector of
coeffts
– This allows us to replace x with a feature function
ϕ(x) and dot product with function
k(x,x(i))=ϕ(x)ϕ(x(i)) called a kernel
• The  operator represents an inner product analogous to
ϕ(x)Tϕ(x(i))
• For some feature spaces we may not literally use an inner
product
– In continuous spaces an inner product based on integration
– Gaussian kernel
• Consider k(u,v) = exp (-||u-v||2/2σ2)
SVM
Deep
Learning

• Prediction
Use linear regression on
Lagrangian for determining the
weights αi
• We can make predictions using
– f (x)= b + Σiαi k(x,x(i))
– Function is nonlinear wrt x but
relationship between
ϕ(x) and f (x) is linear
– Also the relationship between α and f (x)
is linear
– We can think of ϕ as providing a set of
features
• describing x or providing a new
Disadvantages of Kernel
Methods
• Cost of decision function evaluation:
linear in m
– Because the ith example contributes term αi k(x,
x(i))
to the decision function
– Can mitigate this by learning an α with
mostly zeros
• Classification requires evaluating the kernel
function only for training examples that have
a nonzero αi
• These are known as support vectors
• Cost of training: high with large data 25
sets
Options for choosing
mapping ϕ
1. Generic feature function ϕ (x)
– Radial basis function
2. Manually engineer ϕ
– Feature engineering
3. Principle of Deep Learning:
Learn ϕ

26
Option 1 to choose the
• mapping
Generic feature function ϕ (x) ϕ
– Infinite-dimensional ϕ that is implicitly
used by kernel machines based on
RBF
• RBF: N(x ; x(i), σ2I) centered at x(i) σ =mean
x : From
(i) distance
k-means between
clusterin each unit j and
g its
closest
neighbor
– If ϕ(x) is of high enough dimension we can
have enough capacity to fit the
training set
• Generalization to test set remains poor
• Generic feature mappings are based on 27
smoothness
– Do not include prior information to solve advanced
Deep
Learning

Option 2 to choose the



mapping ϕ
Manually engineer ϕ
• This was the dominant approach until
arrival of deep learning
• Requires decades of effort
– e.g., speech recognition, computer vision
• Little transfer between domains

28
Option 3 to choose the
mapping ϕ
• Strategy of Deep learning: Learn ϕ
• Model is y=f (x; θ,w) = ϕ(x; θ)T w
– θ used to learn ϕ from broad class of
functions
– Parameters w map from ϕ (x) to output
– Defines FFN where ϕ define a hidden
layer
• Unlike other two (basis functions,
manual engineering), this approach
gives-up on convexity of training
29
– But its benefits outweigh harms
Deep
Learning

Extend Linear Methods to Learn


ϕ ϕM K outputs y1,..yK for a given
θMD input x
wKM
Hidden layer consists of M
units M ⎛D
k ∑ kj j ⎜∑ ji i
⎠⎟
j =1 ⎞⎝ i=1 j0 k
y (x; θ,w) = w φ⎜ θ x + θ ⎟+ w
0

ϕ1 w10 yk = fk (x;θ,w) = ϕ (x;θ)T w


ϕ0
Can be viewed as a generalization of linear models
• Nonlinear function fk with M+1 parameters wk= (wk0 ,..wkM )
with
• M basis functions, ϕj j=1,..M each with D parameters θj=
(θj1,..θjD)
• Both wk and θj are learnt from data

32
Approaches to
• Learning
Parameterize theϕbasis functions as
ϕ(x;θ)
– Use optimization to find θ that
corresponds to a good representation
• Approach can capture benefit of first
approach (fixed basis functions) by
being highly generic
– By using a broad family for ϕ(x;θ)
• Can also capture benefits of second
approach
– Human practitioners design families of 3
3
ϕ(x;θ) that will perform well
Importance of
• Learning
Learning ϕ
ϕ is discussed beyond
this first introduction to feed-
forward networks
– It is a recurring theme throughout deep
learning applicable to all kinds of
models
• Feedforward networks are application
of this principle to learning
deterministic mappings form x to y
without feedback
• Applicable to
– learning stochastic mappings
Plan of Discussion: Feedforward
Networks
1. A simple example: learning XOR
2. Design decisions for a feedforward
network
– Many are same as for designing a linear
model
• Basics of gradient descent
– Choosing the optimizer, Cost function, Form of output
units
– Some are unique
• Concept of hidden layer
– Makes it necessary to have activation functions
• Architecture of network
– Backpropagation and modern
– 3
How many layers , How are they connected to each
generalizations
other, How many units in each later
5
Deep

1. Ex: XOR
Learning

• XOR: an problem
operation on binary variables x1
and x2
– When exactly one value equals 1 it returns 1
otherwise it returns 0
– Target function is y=f *(x) that we want to
learn
• Our model is y =f ([x1, x2] ; θ) which we learn, i.e.,
adapt parameters θ to make it similar to f *
• Not concerned with statistical
generalization
– Perform correctly on four training points:
•• X={[0,0]
f ([0,1]T;T,θ)[0,1]
= T,[1,0]
f ([1,0] T; θ)
T, [1,1] T} 3
=1 6
– Challenge is to fit the training set
ML for XOR: linear model
•doesn’t
Treat it asfit
regression with MSE loss
function
J(θ) =
4
1
∑ (f *(x) − f (x;θ)) = 4
2 1∑
4

(f *(x n ) − f n
2

(x ;θ)x∈X n=1
)
– Usually not used for binary
Alternative is Cross-entropy
J(θ)
J(θ) = − l nN p(t | θ)

data ∑{ = − t n ln yn +(1 − t n )ln(1 − y n )}

– But math is simple


n=1

• We must choose the form of the


yn= σ (θTxn)

model
• Consider af linear
(x;w,b) = xmodel
w with θ ={w,b}T

+b
where
– J(θ) = 4 ∑ (
1
4
t −x w - to get closed-form
n
T
n
2

n=1 b)
Minimize ) w andsolution
• Differentiate wrt b to obtain w = 0 and b=½
– Then the linear model f(x;w,b)=½ simply outputs 0.5
everywhere
– Why does this 3
7
happen?
Linear model cannot solve
• XOR
Bold numbers are values system must
output
• When x1=0, output has to increase with x2
• When x1=1, output has to decrease with x2

• Linear model f (x;w,b)= x1w1+x2w2+b has to assign a


single weight to x2, so it cannot solve this
problem
• A better solution:
– use a model to learn a different
representation
• in which a linear model is able to represent the
solution 36

– We use a simple feedforward network


Deep
Learning

Feedforward Network for


XOR
• Introduce a simple
feedforward network
– with one hidden layer
containing two units
• Same network drawn in two
different styles
– Matrix W describes mapping from
x to h
– Vector w describes mapping from
h to y 37

– Intercept parameters b are


Functions computed by
• Network
Layer 1 (hidden layer): vector of
hidden units h computed by
function f (1)(x; W,c)
– c are bias variables
• Layer 2 (output layer) computes
f (2)(h; w,b)
– w are linear regression weights
– Output is linear regression applied to
h
rather than to x
• Complete model is 38

(2) (1)
Linear vs Nonlinear
• functions
If we choose both f (1) and f (2) to be
linear, the total function will still be
linear f (x)=xTw’
– Suppose
Then we could f (1)(x)= WTx and
represent f (2)(h)=hTw
this
f (x)=xTw’
function as
f (x)=x Tw’ where w’=Ww
• Since linear is insufficient, we must
use a nonlinear function to describe
the features
– We use the strategy of neural networks
– by using a nonlinear activation function
41

h=g(WTx+c)
Activation
• In linear Function
regression we used a vector of
w and scalar
weights f (x;w,b) = x w
T

+b
bias b
– to describe an affine transformation from
an input vector to an output scalar
• Now we describe an affine
transformation from a vector x to a
vector h, so an entire vector of bias
parameters is needed
• Activation function g is typically
chosen to be applied element-wise
hi=g(xTW:,i+ci) 4
2
Deep
Learning

Default Activation
• Function
Activation: g(z)=max{0,z}
– Applying this to the
output of a linear
transformation yields a
nonlinearfunction
– However transformation
remains A principle of CS:
close to linear Build complicated
systems from
• Piecewise linear with two minimal
pieces components.
A Turing Machine
• Therefore preserve properties Memory needs
that make linear models only 0 and 1 states.
easy to optimize with
We can build
gradient-based methods Universal Function
• Preserve many properties approximator from
ReLUs
that make linear models
generalize
Specifying the Network using
•ReLU
Activation: g(z)=max{0,z}
• We can now specify the complete
network as
f (x; W,c,w,b)=f (2)(f (1)(x))=wT max {0,WTx+c}+b
We can now specify XOR
• Solution
Le

W =⎢1

f (x; W,c,w,b)=

1 ⎤, c =
⎢ 0



w max {0,W x+c}+b ⎢ 1

⎥, b = T T
⎣ 1 ⎦ ⎣⎢ − 1 ⎦, w=⎢
⎣ −2 0
• tNow walk through how model

1

⎥ ⎥ ⎥

batch of a
processes ⎡ ⎤

⎢⎢ 0 0 ⎥
inputs
• Design matrix X of all four ⎡
⎢ 0 0

⎤X =⎢ 0 1
⎥⎢


⎢ 1 0
⎥⎥


• points:
First step is ⎡
⎢ 0 −1 ⎤XW = ⎢
1 1 ⎥
⎥ ⎢⎣ 1 1 ⎥

⎢ ⎥ ⎢ ⎦
In this space all points ⎢
• XW:
Adding liealong a line with slope 1. XW + c =
⎢ 1 ⎥
⎢ 1
0⎢⎥

⎥ ⎣
1 1 ⎥

implemented
Cannot be by a linear ⎢ ⎥ 2 2
⎢ ⎢
• c:
Compute h Using
model ⎡
⎢⎢ 0 0

⎥ ⎢
0
2 1


1 0 ⎥ ⎣
ReLU
Has changed relationship among max{0, X W + c} = ⎢⎥
⎢ ⎢

1 0

⎥ ⎥
⎥ ⎦
examples. They no longer lie on a
⎢ 2 1
A linear
single model
line. ⎣
• Finish by multiplying
suffices ⎦

⎡ ⎤

• by w:
⎢ 0 ⎥

Network has ⎢
f (x) = ⎢

1




obtained
⎢ ⎥
⎣ ⎦
1
⎢ ⎥
0

correct answer for all 4


examples 43
Learned representation for
• XOR
Two points that must
When x =0, output
have output 1 have
1
has to increase with
x2
been collapsed into When x1=1, output
one has to decrease with
x2
• Points x=[0,1]T
and x=[1,0]T have
been mapped
When h1=0, output is
• into h=[0,1] T
Described in linear constant 0 with h2
When h1=1, output is
model constant 1 with h2
When h1=2, output is
– For fixed h2, 1 constant 0
with h2 44
output increases
in h
Deep

About the XOR


Learning

• example
We simply specified the solution
– Then showed that it achieves zero error
• In real situations there might be
billions of parameters and billions of
training examples
– So one cannot simply guess the solution
• Instead gradient descent optimization
can find parameters that produce very
little error
– The solution described is at the global
minimum 45
• Gradient descent could converge to this
solution
Learning XOR
Learning XOR
XOR cant be calculated by a single
perceptron

You might also like