0% found this document useful (0 votes)
80 views

Greedy Layerwise Learning

1. Deep neural networks can be difficult to train directly due to non-convex optimization problems and vanishing gradients. 2. One approach is to pretrain each layer of the deep network in a greedy, unsupervised manner before fine-tuning the entire network jointly using supervised learning. 3. This greedy pretraining approach trains each layer sequentially to learn representations from its input, then fixes its parameters before training the next layer. After pretraining all layers, supervised fine-tuning refines the representations for the final task.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views

Greedy Layerwise Learning

1. Deep neural networks can be difficult to train directly due to non-convex optimization problems and vanishing gradients. 2. One approach is to pretrain each layer of the deep network in a greedy, unsupervised manner before fine-tuning the entire network jointly using supervised learning. 3. This greedy pretraining approach trains each layer sequentially to learn representations from its input, then fixes its parameters before training the next layer. After pretraining all layers, supervised fine-tuning refines the representations for the final task.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 39

Neural Network and Deep Learning

Chapter 4
Deep Neural Network
Topic
Hyper Parameters
Greedy Layer wise Learning

Readings:
Nielsen (online book)
Neural Networks and Deep Learning

1
Outline
• Deep Neural Networks (DNNs)
– Three ideas for training a DNN
– Experiments: MNIST digit classification Part I
– Autoencoders
– Pretraining
• Convolutional Neural Networks (CNNs)
– Convolutional layers
– Pooling layers
– Image recognition
• Recurrent Neural Networks (RNNs) Part II
– Bidirectional RNNs
– Deep Bidirectional RNNs
– Deep Bidirectional LSTMs
– Connection to forward-backward algorithm

2
PRE-TRAINING FOR DEEP NETS

3
A Recipe for
Background
Goals for Today’sMachine
LectureLearning
1. 1.
Given training
Explore data:
a new class of 3. Define functions
decision goal:
(Deep Neural Networks)
2. Consider variants of this recipe for training
2. Choose each of these:
– Decision function 4. Train with SGD:
(take small steps
opposite the gradient)
– Loss function

4
Training Idea #1: No pre-training
 Idea #1: (Just like a shallow network)
 Compute the supervised gradient by backpropagation.
 Take small steps in the direction of the gradient (SGD)

5
Training Comparison on MNIST
• Results from Bengio et al. (2006) on
MNIST digit classification task

2.5 Percent error (lower is better)
2.0
1.5
1.0
% E rro r

6
Training Comparison on MNIST
• Results from Bengio et al. (2006) on
MNIST digit classification task

2.5 Percent error (lower is better)
2.0
1.5
1.0
% E rro r

7
Training Idea #1: No pre-training
 Idea #1: (Just like a shallow network)
 Compute the supervised gradient by backpropagation.
 Take small steps in the direction of the gradient (SGD)

• What goes wrong?


A. Gets stuck in local optima
• Nonconvex objective
• Usually start at a random (bad) point in parameter space
B. Gradient is progressively getting more dilute
• “Vanishing gradients”

8
Problem A:
Training
Nonconvexity

• Where does the nonconvexity come from?


• Even a simple quadratic z = xy objective is
nonconvex:

y
x
9
Problem A:
Training
Nonconvexity

• Where does the nonconvexity come from?


• Even a simple quadratic z = xy objective is
nonconvex:

y
x
10
Problem A:
Training
Nonconvexity
Stochastic Gradient
Descent…

…climbs to the top


of the nearest hill…

11
Problem A:
Training
Nonconvexity
Stochastic Gradient
Descent…

…climbs to the top


of the nearest hill…

12
Problem A:
Training
Nonconvexity
Stochastic Gradient
Descent…

…climbs to the top


of the nearest hill…

13
Problem A:
Training
Nonconvexity
Stochastic Gradient
Descent…

…climbs to the top


of the nearest hill…

14
Problem A:
Training
Nonconvexity
Stochastic Gradient
Descent…

…climbs to the top


of the nearest hill…

…which might not


lead to the top of
the mountain

15
Problem B:
Training
Vanishing Gradients

The gradient for an Output y

edge at the base of the


network depends on Hidden Layer c1 c2 cF

the gradients of many


edges above it Hidden Layer b1 b2 … bE

The chain rule multiplies Hidden Layer a1 a2 … aD

many of these partial


derivatives together Input x1 x2 x3 … xM

16
Problem B:
Training
Vanishing Gradients

The gradient for an Output y

edge at the base of the


network depends on Hidden Layer c1 c2 cF

the gradients of many


edges above it Hidden Layer b1 b2 … bE

The chain rule multiplies Hidden Layer a1 a2 … aD

many of these partial


derivatives together Input x1 x2 x3 … xM

17
Problem B:
Training
Vanishing Gradients

The gradient for an Output

0.1
y

edge at the base of the


network depends on Hidden Layer c1 c2 cF

the gradients of many 0.3


edges above it Hidden Layer b1 b2 … bE

0.2

The chain rule multiplies Hidden Layer a1 a2 … aD

0.7
many of these partial
derivatives together Input x1 x2 x3 … xM

18
Training Idea #1: No pre-training
 Idea #1: (Just like a shallow network)
 Compute the supervised gradient by backpropagation.
 Take small steps in the direction of the gradient (SGD)

• What goes wrong?


A. Gets stuck in local optima
• Nonconvex objective
• Usually start at a random (bad) point in parameter space
B. Gradient is progressively getting more dilute
• “Vanishing gradients”

19
Idea #2: Supervised
Training
Pre-training
 Idea #2: (Two Steps)
 Train each level of the model in a greedy way
 Then use our original idea

1. Supervised Pre-training
– Use labeled data
– Work bottom-up
• Train hidden layer 1. Then fix its parameters.
• Train hidden layer 2. Then fix its parameters.
• …
• Train hidden layer n. Then fix its parameters.
2. Supervised Fine-tuning
– Use labeled data to train following “Idea #1”
– Refine the features by backpropagation so that they become tuned to the end-
task
20
Idea #2: Supervised
Training
Pre-training
 Idea #2: (Two Steps)
 Train each level of the model in a greedy way
 Then use our original idea

Output y

a1 a2 … aD
Hidden Layer 1

x1 x2 x3 … xM
Input

21
Idea #2: Supervised
Training
Pre-training
 Idea #2: (Two Steps)
 Train each level of the model in a greedy way
 Then use our original idea
Output y

b1 b2 … bE
Hidden Layer 2

a1 a2 … aD
Hidden Layer 1

x1 x2 x3 … xM
Input

22
Idea #2: Supervised
Training
Pre-training
 Idea #2: (Two Steps)
Output y

 Train each level of the model in a greedy way


 Then use our original idea …
Hidden Layer 3 c1 c2 cF

b1 b2 … bE
Hidden Layer 2

a1 a2 … aD
Hidden Layer 1

x1 x2 x3 … xM
Input

23
Idea #2: Supervised
Training
Pre-training
 Idea #2: (Two Steps)
Output y

 Train each level of the model in a greedy way


 Then use our original idea …
Hidden Layer 3 c1 c2 cF

b1 b2 … bE
Hidden Layer 2

a1 a2 … aD
Hidden Layer 1

x1 x2 x3 … xM
Input

24
Training Comparison on MNIST
• Results from Bengio et al. (2006) on
MNIST digit classification task

2.5 Percent error (lower is better)
2.0
1.5
1.0
% E rro r

25
Training Comparison on MNIST
• Results from Bengio et al. (2006) on
MNIST digit classification task

2.5 Percent error (lower is better)
2.0
1.5
1.0
% E rro r

26
Idea #3: Unsupervised
Training
Pre-training
 Idea #3: (Two Steps)
 Use our original idea, but pick a better starting point
 Train each level of the model in a greedy way

1. Unsupervised Pre-training
– Use unlabeled data
– Work bottom-up
• Train hidden layer 1. Then fix its parameters.
• Train hidden layer 2. Then fix its parameters.
• …
• Train hidden layer n. Then fix its parameters.
2. Supervised Fine-tuning
– Use labeled data to train following “Idea #1”
– Refine the features by backpropagation so that they become tuned to the end-
task
27
The solution:
Unsupervised pre-training
Unsupervised pre-training
of the first layer:
• What should it predict?
• What else do we
observe?
Output y

• The input!
a1 a2 … aD
Hidden Layer

This topology defines an


Auto-encoder. …
Input x1 x2 x3 xM

28
The solution:
Unsupervised pre-training
Unsupervised pre-training
of the first layer:
• What should it predict?
• What else do we
observe? “Input” x1’ x2’ x3’ … xM ’

• The input!
a1 a2 … aD
Hidden Layer

This topology defines an


Auto-encoder. …
Input x1 x2 x3 xM

29
Auto-Encoders
Key idea: Encourage z to give small reconstruction error:
– x’ is the reconstruction of x
– Loss = || x – DECODER(ENCODER(x)) ||2
– Train with the same backpropagation algorithm for 2-layer Neural
Networks with xm as both input and output.

“Input” x1’ x2’ x3’ … xM ’

DECODER: x’ = h(W’z)
a1 a2 … aD
Hidden Layer

ENCODER: z = h(Wx)
x1 x2 x3 … xM
Input

30
Slide adapted from Raman Arora
The solution:
Unsupervised pre-training

Unsupervised pre-training
• Work bottom-up
– Train hidden layer 1.
Then fix its parameters.
– Train hidden layer 2. “Input” x1’ x2’ x3’ … xM ’
Then fix its parameters.
– …
a1 a2 … aD
– Train hidden layer n.
Hidden Layer

Then fix its parameters.


x1 x2 x3 … xM
Input

31
The solution:
Unsupervised pre-training

Unsupervised pre-training
• Work bottom-up
a1’ a2 ’ … aF’
– Train hidden layer 1.
Then fix its parameters.
– Train hidden layer 2. Hidden Layer b1 b2 … bF

Then fix its parameters.


– …
a1 a2 … aD
– Train hidden layer n.
Hidden Layer

Then fix its parameters.


x1 x2 x3 … xM
Input

32
The solution:
Unsupervised pre-training
b1’ b2 ’ … bF’

Unsupervised pre-training
• Work bottom-up
c1 c2 … cF
– Train hidden layer 1. Hidden Layer

Then fix its parameters.


– Train hidden layer 2. Hidden Layer b1 b2 … bF

Then fix its parameters.


– …
a1 a2 … aD
– Train hidden layer n.
Hidden Layer

Then fix its parameters.


x1 x2 x3 … xM
Input

33
The solution:
Unsupervised pre-training
Unsupervised pre-training Output y

• Work bottom-up
– Train hidden layer 1. Then Hidden Layer c1 c2 … cF

fix its parameters.


– Train hidden layer 2. Then b1 b2 … bF
Hidden Layer
fix its parameters.
– …
– Train hidden layer n. Then Hidden Layer a1 a2 … aD

fix its parameters.


Supervised fine-tuning x1 x2 x3 … xM
Input
Backprop and update all
parameters 34
Deep Network Training
 Idea #1:
1. Supervised fine-tuning only

 Idea #2:
1. Supervised layer-wise pre-training
2. Supervised fine-tuning

 Idea #3:
1. Unsupervised layer-wise pre-training
2. Supervised fine-tuning

35
Training Comparison on MNIST
• Results from Bengio et al. (2006) on
MNIST digit classification task

2.5 Percent error (lower is better)
2.0
1.5
1.0
% E rro r

36
Training Comparison on MNIST
• Results from Bengio et al. (2006) on
MNIST digit classification task

2.5 Percent error (lower is better)
2.0
1.5
1.0
% E rro r

37
Is layer-wise pre-training
Training
always necessary?

In 2010, a record on a hand-writing


recognition task was set by standard supervised
backpropagation (our Idea #1).

How? A very fast implementation on GPUs.

See Ciresen et al. (2010)


38
Deep Learning
• Goal: learn features at different levels of
abstraction
• Training can be tricky due to…
– Nonconvexity
– Vanishing gradients
• Unsupervised layer-wise pre-training can
help with both!

39

You might also like