0% found this document useful (0 votes)

80 views

Greedy Layerwise Learning

1. Deep neural networks can be difficult to train directly due to non-convex optimization problems and vanishing gradients. 2. One approach is to pretrain each layer of the deep network in a greedy, unsupervised manner before fine-tuning the entire network jointly using supervised learning. 3. This greedy pretraining approach trains each layer sequentially to learn representations from its input, then fixes its parameters before training the next layer. After pretraining all layers, supervised fine-tuning refines the representations for the final task.

Uploaded by

PRATIK GANGAPURWALA

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

80 views

Greedy Layerwise Learning

Uploaded by

PRATIK GANGAPURWALA

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 39

Neural Network and Deep Learning

Chapter 4
Deep Neural Network
Topic
Hyper Parameters
Greedy Layer wise Learning

Readings:
Nielsen (online book)
Neural Networks and Deep Learning

1
Outline
• Deep Neural Networks (DNNs)
– Three ideas for training a DNN
– Experiments: MNIST digit classification Part I
– Autoencoders
– Pretraining
• Convolutional Neural Networks (CNNs)
– Convolutional layers
– Pooling layers
– Image recognition
• Recurrent Neural Networks (RNNs) Part II
– Bidirectional RNNs
– Deep Bidirectional RNNs
– Deep Bidirectional LSTMs
– Connection to forward-backward algorithm

2
PRE-TRAINING FOR DEEP NETS

3
A Recipe for
Background
Goals for Today’sMachine
LectureLearning
1. 1.
Given training
Explore data:
a new class of 3. Define functions
decision goal:
(Deep Neural Networks)
2. Consider variants of this recipe for training
2. Choose each of these:
– Decision function 4. Train with SGD:
(take small steps
opposite the gradient)
– Loss function

4
Training Idea #1: No pre-training
 Idea #1: (Just like a shallow network)
 Compute the supervised gradient by backpropagation.
 Take small steps in the direction of the gradient (SGD)

5
Training Comparison on MNIST
• Results from Bengio et al. (2006) on
MNIST digit classification task
•
2.5 Percent error (lower is better)
2.0
1.5
1.0
% E rro r

6
Training Comparison on MNIST
• Results from Bengio et al. (2006) on
MNIST digit classification task
•
2.5 Percent error (lower is better)
2.0
1.5
1.0
% E rro r

7
Training Idea #1: No pre-training
 Idea #1: (Just like a shallow network)
 Compute the supervised gradient by backpropagation.
 Take small steps in the direction of the gradient (SGD)

• What goes wrong?

A. Gets stuck in local optima
• Nonconvex objective
• Usually start at a random (bad) point in parameter space
B. Gradient is progressively getting more dilute
• “Vanishing gradients”

8
Problem A:
Training
Nonconvexity

• Where does the nonconvexity come from?

• Even a simple quadratic z = xy objective is
nonconvex:

y
x
9
Problem A:
Training
Nonconvexity

• Where does the nonconvexity come from?

• Even a simple quadratic z = xy objective is
nonconvex:

y
x
10
Problem A:
Training
Nonconvexity
Stochastic Gradient
Descent…

…climbs to the top

of the nearest hill…

11
Problem A:
Training
Nonconvexity
Stochastic Gradient
Descent…

…climbs to the top

of the nearest hill…

12
Problem A:
Training
Nonconvexity
Stochastic Gradient
Descent…

…climbs to the top

of the nearest hill…

13
Problem A:
Training
Nonconvexity
Stochastic Gradient
Descent…

…climbs to the top

of the nearest hill…

14
Problem A:
Training
Nonconvexity
Stochastic Gradient
Descent…

…climbs to the top

of the nearest hill…

…which might not

lead to the top of
the mountain

15
Problem B:
Training
Vanishing Gradients

The gradient for an Output y

edge at the base of the

…

network depends on Hidden Layer c1 c2 cF

the gradients of many

edges above it Hidden Layer b1 b2 … bE

The chain rule multiplies Hidden Layer a1 a2 … aD

many of these partial

derivatives together Input x1 x2 x3 … xM

16
Problem B:
Training
Vanishing Gradients

The gradient for an Output y

edge at the base of the

…

network depends on Hidden Layer c1 c2 cF

the gradients of many

edges above it Hidden Layer b1 b2 … bE

The chain rule multiplies Hidden Layer a1 a2 … aD

many of these partial

derivatives together Input x1 x2 x3 … xM

17
Problem B:
Training
Vanishing Gradients

The gradient for an Output

0.1
y

edge at the base of the

…

network depends on Hidden Layer c1 c2 cF

the gradients of many 0.3

edges above it Hidden Layer b1 b2 … bE

0.2

The chain rule multiplies Hidden Layer a1 a2 … aD

0.7
many of these partial
derivatives together Input x1 x2 x3 … xM

18
Training Idea #1: No pre-training
 Idea #1: (Just like a shallow network)
 Compute the supervised gradient by backpropagation.
 Take small steps in the direction of the gradient (SGD)

• What goes wrong?

A. Gets stuck in local optima
• Nonconvex objective
• Usually start at a random (bad) point in parameter space
B. Gradient is progressively getting more dilute
• “Vanishing gradients”

19
Idea #2: Supervised
Training
Pre-training
 Idea #2: (Two Steps)
 Train each level of the model in a greedy way
 Then use our original idea

1. Supervised Pre-training
– Use labeled data
– Work bottom-up
• Train hidden layer 1. Then fix its parameters.
• Train hidden layer 2. Then fix its parameters.
• …
• Train hidden layer n. Then fix its parameters.
2. Supervised Fine-tuning
– Use labeled data to train following “Idea #1”
– Refine the features by backpropagation so that they become tuned to the end-
task
20
Idea #2: Supervised
Training
Pre-training
 Idea #2: (Two Steps)
 Train each level of the model in a greedy way
 Then use our original idea

Output y

a1 a2 … aD
Hidden Layer 1

x1 x2 x3 … xM
Input

21
Idea #2: Supervised
Training
Pre-training
 Idea #2: (Two Steps)
 Train each level of the model in a greedy way
 Then use our original idea
Output y

b1 b2 … bE
Hidden Layer 2

a1 a2 … aD
Hidden Layer 1

x1 x2 x3 … xM
Input

22
Idea #2: Supervised
Training
Pre-training
 Idea #2: (Two Steps)
Output y

 Train each level of the model in a greedy way

 Then use our original idea …
Hidden Layer 3 c1 c2 cF

b1 b2 … bE
Hidden Layer 2

a1 a2 … aD
Hidden Layer 1

x1 x2 x3 … xM
Input

23
Idea #2: Supervised
Training
Pre-training
 Idea #2: (Two Steps)
Output y

 Train each level of the model in a greedy way

 Then use our original idea …
Hidden Layer 3 c1 c2 cF

b1 b2 … bE
Hidden Layer 2

a1 a2 … aD
Hidden Layer 1

x1 x2 x3 … xM
Input

24
Training Comparison on MNIST
• Results from Bengio et al. (2006) on
MNIST digit classification task
•
2.5 Percent error (lower is better)
2.0
1.5
1.0
% E rro r

25
Training Comparison on MNIST
• Results from Bengio et al. (2006) on
MNIST digit classification task
•
2.5 Percent error (lower is better)
2.0
1.5
1.0
% E rro r

26
Idea #3: Unsupervised
Training
Pre-training
 Idea #3: (Two Steps)
 Use our original idea, but pick a better starting point
 Train each level of the model in a greedy way

1. Unsupervised Pre-training
– Use unlabeled data
– Work bottom-up
• Train hidden layer 1. Then fix its parameters.
• Train hidden layer 2. Then fix its parameters.
• …
• Train hidden layer n. Then fix its parameters.
2. Supervised Fine-tuning
– Use labeled data to train following “Idea #1”
– Refine the features by backpropagation so that they become tuned to the end-
task
27
The solution:
Unsupervised pre-training
Unsupervised pre-training
of the first layer:
• What should it predict?
• What else do we
observe?
Output y

• The input!
a1 a2 … aD
Hidden Layer

This topology defines an

Auto-encoder. …
Input x1 x2 x3 xM

28
The solution:
Unsupervised pre-training
Unsupervised pre-training
of the first layer:
• What should it predict?
• What else do we
observe? “Input” x1’ x2’ x3’ … xM ’

• The input!
a1 a2 … aD
Hidden Layer

This topology defines an

Auto-encoder. …
Input x1 x2 x3 xM

29
Auto-Encoders
Key idea: Encourage z to give small reconstruction error:
– x’ is the reconstruction of x
– Loss = || x – DECODER(ENCODER(x)) ||2
– Train with the same backpropagation algorithm for 2-layer Neural
Networks with xm as both input and output.

“Input” x1’ x2’ x3’ … xM ’

DECODER: x’ = h(W’z)
a1 a2 … aD
Hidden Layer

ENCODER: z = h(Wx)
x1 x2 x3 … xM
Input

30
Slide adapted from Raman Arora
The solution:
Unsupervised pre-training

Unsupervised pre-training
• Work bottom-up
– Train hidden layer 1.
Then fix its parameters.
– Train hidden layer 2. “Input” x1’ x2’ x3’ … xM ’
Then fix its parameters.
– …
a1 a2 … aD
– Train hidden layer n.
Hidden Layer

Then fix its parameters.

x1 x2 x3 … xM
Input

31
The solution:
Unsupervised pre-training

Unsupervised pre-training
• Work bottom-up
a1’ a2 ’ … aF’
– Train hidden layer 1.
Then fix its parameters.
– Train hidden layer 2. Hidden Layer b1 b2 … bF

Then fix its parameters.

– …
a1 a2 … aD
– Train hidden layer n.
Hidden Layer

Then fix its parameters.

x1 x2 x3 … xM
Input

32
The solution:
Unsupervised pre-training
b1’ b2 ’ … bF’

Unsupervised pre-training
• Work bottom-up
c1 c2 … cF
– Train hidden layer 1. Hidden Layer

Then fix its parameters.

– Train hidden layer 2. Hidden Layer b1 b2 … bF

Then fix its parameters.

– …
a1 a2 … aD
– Train hidden layer n.
Hidden Layer

Then fix its parameters.

x1 x2 x3 … xM
Input

33
The solution:
Unsupervised pre-training
Unsupervised pre-training Output y

• Work bottom-up
– Train hidden layer 1. Then Hidden Layer c1 c2 … cF

fix its parameters.

– Train hidden layer 2. Then b1 b2 … bF
Hidden Layer
fix its parameters.
– …
– Train hidden layer n. Then Hidden Layer a1 a2 … aD

fix its parameters.

Supervised fine-tuning x1 x2 x3 … xM
Input
Backprop and update all
parameters 34
Deep Network Training
 Idea #1:
1. Supervised fine-tuning only

 Idea #2:
1. Supervised layer-wise pre-training
2. Supervised fine-tuning

 Idea #3:
1. Unsupervised layer-wise pre-training
2. Supervised fine-tuning

35
Training Comparison on MNIST
• Results from Bengio et al. (2006) on
MNIST digit classification task
•
2.5 Percent error (lower is better)
2.0
1.5
1.0
% E rro r

36
Training Comparison on MNIST
• Results from Bengio et al. (2006) on
MNIST digit classification task
•
2.5 Percent error (lower is better)
2.0
1.5
1.0
% E rro r

37
Is layer-wise pre-training
Training
always necessary?

In 2010, a record on a hand-writing

recognition task was set by standard supervised
backpropagation (our Idea #1).

How? A very fast implementation on GPUs.

See Ciresen et al. (2010)

38
Deep Learning
• Goal: learn features at different levels of
abstraction
• Training can be tricky due to…
– Nonconvexity
– Vanishing gradients
• Unsupervised layer-wise pre-training can
help with both!

Free PHP Pdf417 Barcode Generator
0% (2)
Free PHP Pdf417 Barcode Generator
2 pages
D146/D147/D148/D149/D150 Service Manual
No ratings yet
D146/D147/D148/D149/D150 Service Manual
70 pages
CB Insights AI 100 Trends Webinar
No ratings yet
CB Insights AI 100 Trends Webinar
34 pages
ANN - Back Propagation
No ratings yet
ANN - Back Propagation
22 pages
MN906 AI Watermarking
No ratings yet
MN906 AI Watermarking
99 pages
Mlfa Autumn 22 Lec 05
No ratings yet
Mlfa Autumn 22 Lec 05
29 pages
Lecture 17. Convolutional Neural Networks PDF
No ratings yet
Lecture 17. Convolutional Neural Networks PDF
32 pages
WEEK 9
No ratings yet
WEEK 9
80 pages
Machine Learning: Chapter 4. Artificial Neural Networks
No ratings yet
Machine Learning: Chapter 4. Artificial Neural Networks
34 pages
Deep Learning Module-02
No ratings yet
Deep Learning Module-02
15 pages
Graphical Method PDF
No ratings yet
Graphical Method PDF
11 pages
Introduction To Compressive Sensing: Department of Electrical Engineering University of Isfahan
No ratings yet
Introduction To Compressive Sensing: Department of Electrical Engineering University of Isfahan
50 pages
3 ArtificialNeuralNetworks PDF
No ratings yet
3 ArtificialNeuralNetworks PDF
77 pages
L10 - Intro - To - Deep - Learning
No ratings yet
L10 - Intro - To - Deep - Learning
75 pages
Deep Learning Handson
No ratings yet
Deep Learning Handson
65 pages
CNN2
No ratings yet
CNN2
70 pages
6-DeepVisualLearning L6
No ratings yet
6-DeepVisualLearning L6
82 pages
Neural Network BSC
No ratings yet
Neural Network BSC
32 pages
Deep+Learning+Module-02+Search+Creators
No ratings yet
Deep+Learning+Module-02+Search+Creators
15 pages
Backprop and Optimizers
No ratings yet
Backprop and Optimizers
62 pages
learning1
No ratings yet
learning1
68 pages
Image Recognition
No ratings yet
Image Recognition
47 pages
ResNet Presentation
No ratings yet
ResNet Presentation
25 pages
Learning
No ratings yet
Learning
63 pages
CS585 Lecture October15th
No ratings yet
CS585 Lecture October15th
162 pages
subdivision
No ratings yet
subdivision
5 pages
Dynamic Programming (DP) 02 - Class Notes
No ratings yet
Dynamic Programming (DP) 02 - Class Notes
38 pages
Convex Optimization
No ratings yet
Convex Optimization
48 pages
Lecture2 PDF
No ratings yet
Lecture2 PDF
111 pages
AI_slide_2
No ratings yet
AI_slide_2
82 pages
Calculation: Question 2: ANN-Multi Layer Perceptron (05 Marks)
No ratings yet
Calculation: Question 2: ANN-Multi Layer Perceptron (05 Marks)
2 pages
2016 05 Viola Jones PDF
No ratings yet
2016 05 Viola Jones PDF
51 pages
Module 4 Continued
No ratings yet
Module 4 Continued
244 pages
Dynamic Programming Approach
No ratings yet
Dynamic Programming Approach
5 pages
W07 Divide and Conquer Lecture 13 23102024 021723pm
No ratings yet
W07 Divide and Conquer Lecture 13 23102024 021723pm
9 pages
DL Lecture 01 - 0420
No ratings yet
DL Lecture 01 - 0420
13 pages
Lecture_10_slides_-_after
No ratings yet
Lecture_10_slides_-_after
66 pages
Op Tim Ization Intro
No ratings yet
Op Tim Ization Intro
49 pages
lec4
No ratings yet
lec4
19 pages
Deep Learning (Part 27)-Backpropagation Intuition _ by Coursesteach _ Medium
No ratings yet
Deep Learning (Part 27)-Backpropagation Intuition _ by Coursesteach _ Medium
9 pages
DP - Intro With Links
No ratings yet
DP - Intro With Links
10 pages
micro2
No ratings yet
micro2
1 page
Unit 1 - 1 QMR - Reviewing Java Part 1
No ratings yet
Unit 1 - 1 QMR - Reviewing Java Part 1
50 pages
NNTraining_Backprop
No ratings yet
NNTraining_Backprop
14 pages
Dynamic Programming Introduction - Tutorial (Updated)
No ratings yet
Dynamic Programming Introduction - Tutorial (Updated)
6 pages
Deep Nets
No ratings yet
Deep Nets
25 pages
ch00 Section2
No ratings yet
ch00 Section2
4 pages
DL Quiz1
No ratings yet
DL Quiz1
5 pages
Examen Deep Learning
100% (1)
Examen Deep Learning
8 pages
L7-Lecture-Image.classification.DNN-v4
No ratings yet
L7-Lecture-Image.classification.DNN-v4
61 pages
CS434a/541a: Pattern Recognition Prof. Olga Veksler
No ratings yet
CS434a/541a: Pattern Recognition Prof. Olga Veksler
42 pages
NN 06
No ratings yet
NN 06
18 pages
Lecture14 - ML (FF, Autoenc, Dense Networks)
No ratings yet
Lecture14 - ML (FF, Autoenc, Dense Networks)
28 pages
CS221 - Artificial Intelligence - Machine Learning - 1 Overview
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 1 Overview
16 pages
05 Image Segmentation (Chapter 10)
No ratings yet
05 Image Segmentation (Chapter 10)
42 pages
Skin Melanoma Stage Detection - CNN
No ratings yet
Skin Melanoma Stage Detection - CNN
55 pages
Activation functions
No ratings yet
Activation functions
9 pages
Convolutional Neuralnetworks: Abin - Roozgard
No ratings yet
Convolutional Neuralnetworks: Abin - Roozgard
54 pages
Lec10 Image Segmentation 2
No ratings yet
Lec10 Image Segmentation 2
110 pages
slides_nn
No ratings yet
slides_nn
59 pages
Hidden Surface Determination: Unveiling the Secrets of Computer Vision
From Everand
Hidden Surface Determination: Unveiling the Secrets of Computer Vision
Fouad Sabry
No ratings yet
MATLAB Demystified
From Everand
MATLAB Demystified
David McMahon
5/5 (1)
Mastering Dynamic Programming in Java
From Everand
Mastering Dynamic Programming in Java
Ed A Norex
No ratings yet
Soundbank User Guide: Software Version 1.0 EN180305
No ratings yet
Soundbank User Guide: Software Version 1.0 EN180305
13 pages
Ad PDF
0% (1)
Ad PDF
1,997 pages
Coursedog Banner Ethos Data Flow Diagram - 083122
No ratings yet
Coursedog Banner Ethos Data Flow Diagram - 083122
1 page
Santhosh Kumar.j Resume
No ratings yet
Santhosh Kumar.j Resume
1 page
Microsoft Word - B.Tech. - 3rd - Yr - CSE (DS) - 2022 - 23
No ratings yet
Microsoft Word - B.Tech. - 3rd - Yr - CSE (DS) - 2022 - 23
43 pages
Proposal Document - 10-06-2024
No ratings yet
Proposal Document - 10-06-2024
6 pages
@urlpasscloud_16059
No ratings yet
@urlpasscloud_16059
6,381 pages
Hack Net
No ratings yet
Hack Net
13 pages
The Information System: An Accountant's Perspective: Mustakim Muchlis
No ratings yet
The Information System: An Accountant's Perspective: Mustakim Muchlis
19 pages
SOP Appropriate Opening Procedure For New Request Items
No ratings yet
SOP Appropriate Opening Procedure For New Request Items
7 pages
Ecs q67h2 Ad q65h2 Ad h67h2 Ad
No ratings yet
Ecs q67h2 Ad q65h2 Ad h67h2 Ad
44 pages
Class X - IT SQP 2020 to 2024 With MS
No ratings yet
Class X - IT SQP 2020 to 2024 With MS
70 pages
Microprocessors Lab Viva Questions and Answers
No ratings yet
Microprocessors Lab Viva Questions and Answers
10 pages
Quick Start Guide: Get Started in 5 Minutes!
No ratings yet
Quick Start Guide: Get Started in 5 Minutes!
7 pages
Portofolio My Projects
No ratings yet
Portofolio My Projects
35 pages
Humane Assessment
No ratings yet
Humane Assessment
19 pages
T.Y.B.Sc. (Computer Science) - 07.07.2021
No ratings yet
T.Y.B.Sc. (Computer Science) - 07.07.2021
46 pages
Intel Processors: Latest Configuration Multimedia Desktop Computer at Lowest Price Guranteed
No ratings yet
Intel Processors: Latest Configuration Multimedia Desktop Computer at Lowest Price Guranteed
48 pages
Defects Clustering and Pesticide Paradox
No ratings yet
Defects Clustering and Pesticide Paradox
10 pages
Module IT COM1 Unit 4
No ratings yet
Module IT COM1 Unit 4
15 pages
Bibliography
No ratings yet
Bibliography
7 pages
Thermal Printer Troubleshooting
No ratings yet
Thermal Printer Troubleshooting
2 pages
DG Sync Manual
No ratings yet
DG Sync Manual
2 pages
AIS Notes
No ratings yet
AIS Notes
21 pages
ICT Jss 1 1st Term
No ratings yet
ICT Jss 1 1st Term
2 pages
13 PHP MVC Frameworks ORM Basics
No ratings yet
13 PHP MVC Frameworks ORM Basics
25 pages
Learning Management System
No ratings yet
Learning Management System
5 pages

Greedy Layerwise Learning

Uploaded by

Greedy Layerwise Learning

Uploaded by

Neural Network and Deep Learning

• What goes wrong?

• Where does the nonconvexity come from?

• Where does the nonconvexity come from?

…climbs to the top

…climbs to the top

…climbs to the top

…climbs to the top

…climbs to the top

…which might not

The gradient for an Output y

edge at the base of the

network depends on Hidden Layer c1 c2 cF

the gradients of many

The chain rule multiplies Hidden Layer a1 a2 … aD

many of these partial

The gradient for an Output y

edge at the base of the

network depends on Hidden Layer c1 c2 cF

the gradients of many

The chain rule multiplies Hidden Layer a1 a2 … aD

many of these partial

The gradient for an Output

edge at the base of the

network depends on Hidden Layer c1 c2 cF

the gradients of many 0.3

The chain rule multiplies Hidden Layer a1 a2 … aD

• What goes wrong?

 Train each level of the model in a greedy way

 Train each level of the model in a greedy way

This topology defines an

This topology defines an

“Input” x1’ x2’ x3’ … xM ’

Then fix its parameters.

Then fix its parameters.

Then fix its parameters.

Then fix its parameters.

Then fix its parameters.

Then fix its parameters.

fix its parameters.

fix its parameters.

In 2010, a record on a hand-writing

How? A very fast implementation on GPUs.

See Ciresen et al. (2010)

You might also like