BMM 2018 - Deep Learning Tutorial
BMM 2018 - Deep Learning Tutorial
General references:
Hertz, Krogh, Palmer 1991
Goodfellow, Bengio, Courville 2016
Supervised learning
Given example input-output pairs (X,Y),
learn to predict output Y from input X
Simple perceptrons can only learn to solve linearly separable problems (Minsky and Papert 1969).
We can solve more complex problems by composing many units in multiple layers.
Multilayer perceptron (MLP)
(“forward propagation”)
Two motivations for using deep nets instead (see Goodfellow et al 2016, section 6.4.1):
● Statistical: deep nets are compositional, and naturally well suited to representing hierarchical
structures where simpler patterns are composed and reused to form more complex ones
recursively. It can be argued that many interesting structures in real world data are like this.
● Computational: under certain conditions, it can be proved that deep architectures are more
expressive than shallow ones, i.e. they can learn more patterns for a given total size of the network.
Problem: compute all
Key insights: the loss depends
● on the weights w of a unit only through that unit’s
Backpropagation
activation h
● on a unit’s activation h only through the activation of
those units that are downstream from h.
(Rumelhart, Hinton, Williams 1986)
These give the gradient of the loss with respect to the weights,
which you can then use with your favorite gradient descent method.
Backpropagation - example
● Match low-level
vision features
(e.g. edge, HOG,
SIFT, etc)
● Parts-based
models (Lowe 2004)
Learning the features - inspiration from neuroscience
(LeCun et al 1998)
2D Convolution
kernel / filter
kernel / filter
kernel / filter
kernel / filter
kernel / filter
1 1
32 32 28
1
3
3
Input depth = # of channels in previous layer
(often 3 for input layer (RGB); can be arbitrary Output depth = # of filters
for deeper layers) (feature maps)
Convolve with Different Filters
Ⓧ =
Convolution (with learned filters)
Feature map
Fully Connected vs. Locally Connected
Credit: Ranzato’s
CVPR 2014 tutorial
output
Non-linearity
input
● Rectified linear function (ReLU)
○ Applied per-pixel, output = max(0, input)
test
image
smallest Euclidian distance to test image
PRP VBP DT NN
Total loss
Parameter Sharing Across Time
t
Vanishing Gradient
X f f f Loss
● expanded quickly!
○ |.| > 1, gradient explodes → clipping gradients
○ |.| < 1, gradient vanishes → introducing memory via LSTMs, GRUs
● Introducing gates to
optionally let information
flow through.
○ An LSTM cell has three
gates to protect and
control the cell state.
● Learning representations
○ a good representation should keep the information well
Encoder Decoder
[LeCun, 1987]
latent variables:
color, shape, position, ...
Generative Models
p(z) p(x|z)
● Idea: approximate p(z|x)
with a simpler, tractable q(z|x)
z Decoder x
● Learning objective
Reconstruction error
z Encoder x
● Hands-on session on
Monday!