Greedy Layerwise Learning
Greedy Layerwise Learning
Chapter 4
Deep Neural Network
Topic
Hyper Parameters
Greedy Layer wise Learning
Readings:
Nielsen (online book)
Neural Networks and Deep Learning
1
Outline
• Deep Neural Networks (DNNs)
– Three ideas for training a DNN
– Experiments: MNIST digit classification Part I
– Autoencoders
– Pretraining
• Convolutional Neural Networks (CNNs)
– Convolutional layers
– Pooling layers
– Image recognition
• Recurrent Neural Networks (RNNs) Part II
– Bidirectional RNNs
– Deep Bidirectional RNNs
– Deep Bidirectional LSTMs
– Connection to forward-backward algorithm
2
PRE-TRAINING FOR DEEP NETS
3
A Recipe for
Background
Goals for Today’sMachine
LectureLearning
1. 1.
Given training
Explore data:
a new class of 3. Define functions
decision goal:
(Deep Neural Networks)
2. Consider variants of this recipe for training
2. Choose each of these:
– Decision function 4. Train with SGD:
(take small steps
opposite the gradient)
– Loss function
4
Training Idea #1: No pre-training
Idea #1: (Just like a shallow network)
Compute the supervised gradient by backpropagation.
Take small steps in the direction of the gradient (SGD)
5
Training Comparison on MNIST
• Results from Bengio et al. (2006) on
MNIST digit classification task
•
2.5 Percent error (lower is better)
2.0
1.5
1.0
% E rro r
6
Training Comparison on MNIST
• Results from Bengio et al. (2006) on
MNIST digit classification task
•
2.5 Percent error (lower is better)
2.0
1.5
1.0
% E rro r
7
Training Idea #1: No pre-training
Idea #1: (Just like a shallow network)
Compute the supervised gradient by backpropagation.
Take small steps in the direction of the gradient (SGD)
8
Problem A:
Training
Nonconvexity
y
x
9
Problem A:
Training
Nonconvexity
y
x
10
Problem A:
Training
Nonconvexity
Stochastic Gradient
Descent…
11
Problem A:
Training
Nonconvexity
Stochastic Gradient
Descent…
12
Problem A:
Training
Nonconvexity
Stochastic Gradient
Descent…
13
Problem A:
Training
Nonconvexity
Stochastic Gradient
Descent…
14
Problem A:
Training
Nonconvexity
Stochastic Gradient
Descent…
15
Problem B:
Training
Vanishing Gradients
16
Problem B:
Training
Vanishing Gradients
17
Problem B:
Training
Vanishing Gradients
0.1
y
0.2
0.7
many of these partial
derivatives together Input x1 x2 x3 … xM
18
Training Idea #1: No pre-training
Idea #1: (Just like a shallow network)
Compute the supervised gradient by backpropagation.
Take small steps in the direction of the gradient (SGD)
19
Idea #2: Supervised
Training
Pre-training
Idea #2: (Two Steps)
Train each level of the model in a greedy way
Then use our original idea
1. Supervised Pre-training
– Use labeled data
– Work bottom-up
• Train hidden layer 1. Then fix its parameters.
• Train hidden layer 2. Then fix its parameters.
• …
• Train hidden layer n. Then fix its parameters.
2. Supervised Fine-tuning
– Use labeled data to train following “Idea #1”
– Refine the features by backpropagation so that they become tuned to the end-
task
20
Idea #2: Supervised
Training
Pre-training
Idea #2: (Two Steps)
Train each level of the model in a greedy way
Then use our original idea
Output y
a1 a2 … aD
Hidden Layer 1
x1 x2 x3 … xM
Input
21
Idea #2: Supervised
Training
Pre-training
Idea #2: (Two Steps)
Train each level of the model in a greedy way
Then use our original idea
Output y
b1 b2 … bE
Hidden Layer 2
a1 a2 … aD
Hidden Layer 1
x1 x2 x3 … xM
Input
22
Idea #2: Supervised
Training
Pre-training
Idea #2: (Two Steps)
Output y
b1 b2 … bE
Hidden Layer 2
a1 a2 … aD
Hidden Layer 1
x1 x2 x3 … xM
Input
23
Idea #2: Supervised
Training
Pre-training
Idea #2: (Two Steps)
Output y
b1 b2 … bE
Hidden Layer 2
a1 a2 … aD
Hidden Layer 1
x1 x2 x3 … xM
Input
24
Training Comparison on MNIST
• Results from Bengio et al. (2006) on
MNIST digit classification task
•
2.5 Percent error (lower is better)
2.0
1.5
1.0
% E rro r
25
Training Comparison on MNIST
• Results from Bengio et al. (2006) on
MNIST digit classification task
•
2.5 Percent error (lower is better)
2.0
1.5
1.0
% E rro r
26
Idea #3: Unsupervised
Training
Pre-training
Idea #3: (Two Steps)
Use our original idea, but pick a better starting point
Train each level of the model in a greedy way
1. Unsupervised Pre-training
– Use unlabeled data
– Work bottom-up
• Train hidden layer 1. Then fix its parameters.
• Train hidden layer 2. Then fix its parameters.
• …
• Train hidden layer n. Then fix its parameters.
2. Supervised Fine-tuning
– Use labeled data to train following “Idea #1”
– Refine the features by backpropagation so that they become tuned to the end-
task
27
The solution:
Unsupervised pre-training
Unsupervised pre-training
of the first layer:
• What should it predict?
• What else do we
observe?
Output y
• The input!
a1 a2 … aD
Hidden Layer
28
The solution:
Unsupervised pre-training
Unsupervised pre-training
of the first layer:
• What should it predict?
• What else do we
observe? “Input” x1’ x2’ x3’ … xM ’
• The input!
a1 a2 … aD
Hidden Layer
29
Auto-Encoders
Key idea: Encourage z to give small reconstruction error:
– x’ is the reconstruction of x
– Loss = || x – DECODER(ENCODER(x)) ||2
– Train with the same backpropagation algorithm for 2-layer Neural
Networks with xm as both input and output.
DECODER: x’ = h(W’z)
a1 a2 … aD
Hidden Layer
ENCODER: z = h(Wx)
x1 x2 x3 … xM
Input
30
Slide adapted from Raman Arora
The solution:
Unsupervised pre-training
Unsupervised pre-training
• Work bottom-up
– Train hidden layer 1.
Then fix its parameters.
– Train hidden layer 2. “Input” x1’ x2’ x3’ … xM ’
Then fix its parameters.
– …
a1 a2 … aD
– Train hidden layer n.
Hidden Layer
31
The solution:
Unsupervised pre-training
Unsupervised pre-training
• Work bottom-up
a1’ a2 ’ … aF’
– Train hidden layer 1.
Then fix its parameters.
– Train hidden layer 2. Hidden Layer b1 b2 … bF
32
The solution:
Unsupervised pre-training
b1’ b2 ’ … bF’
Unsupervised pre-training
• Work bottom-up
c1 c2 … cF
– Train hidden layer 1. Hidden Layer
33
The solution:
Unsupervised pre-training
Unsupervised pre-training Output y
• Work bottom-up
– Train hidden layer 1. Then Hidden Layer c1 c2 … cF
Idea #2:
1. Supervised layer-wise pre-training
2. Supervised fine-tuning
Idea #3:
1. Unsupervised layer-wise pre-training
2. Supervised fine-tuning
35
Training Comparison on MNIST
• Results from Bengio et al. (2006) on
MNIST digit classification task
•
2.5 Percent error (lower is better)
2.0
1.5
1.0
% E rro r
36
Training Comparison on MNIST
• Results from Bengio et al. (2006) on
MNIST digit classification task
•
2.5 Percent error (lower is better)
2.0
1.5
1.0
% E rro r
37
Is layer-wise pre-training
Training
always necessary?
39