Deep Learning for Computer Vision: Attention Models (UPC 2016)
This document discusses attention models, emphasizing their motivation and applications in image processing and neural machine translation. It covers both soft and hard attention mechanisms, their computational benefits, and their integration into tasks like image captioning and visual question answering. Additionally, it highlights the challenge of differentiability in hard attention and presents spatial transformer networks as a solution for making certain functions trainable.
Attention Models: Motivation
Image:
Hx W x 3
bird
The whole input volume is used to predict the output...
...despite the fact that not all pixels are equally important
3
Encoder & Decoder
6
KyunghyunCho, “Introduction to Neural
Machine Translation with GPUs” (2015)
From previous lecture...
The whole input sentence
is used to produce the translation
7.
Attention Models
7
Bahdanau etal. Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
8.
Attention Models
A birdflying over a body of water
Idea: Focus in different parts of the input as you make/refine predictions in time
E.g.: Image Captioning
8
9.
LSTM Decoder
LSTMLSTM LSTM
CNNLSTM
A bird flying
...
<EOS>
The LSTM decoder “sees” the input only at the beginning !
Features:
D
9
...
10.
Attention for ImageCaptioning
CNN
Image:
H x W x 3
Features:
L x D
Slide Credit: CS231n 10
11.
Attention for ImageCaptioning
CNN
Image:
H x W x 3
Features:
L x D
h0
a1
Slide Credit: CS231n
Attention
weights (LxD)
11
12.
Attention for ImageCaptioning
Slide Credit: CS231n
CNN
Image:
H x W x 3
Features:
L x D
h0
a1
z1
Weighted
combination
of features
y1
h1
First word
Attention
weights (LxD)
a2 y2
Weighted
features: D
predicted
word
12
13.
Attention for ImageCaptioning
CNN
Image:
H x W x 3
Features:
L x D
h0
a1
z1
Weighted
combination
of features
y1
h1
First word
a2 y2
h2
a3 y3
z2 y2
Weighted
features: D
predicted
word
Attention
weights (LxD)
Slide Credit: CS231n 13
14.
Attention for ImageCaptioning
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
14
15.
Attention for ImageCaptioning
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
15
16.
Attention for ImageCaptioning
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
16
17.
Soft Attention
Xu etal. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
CNN
Image:
H x W x 3
Grid of features
(Each
D-dimensional)
a b
c d
pa
pb
pc
pd
Distribution over
grid locations
pa
+ pb
+ pc
+ pc
= 1
Soft attention:
Summarize ALL locations
z = pa
a+ pb
b + pc
c + pd
d
Derivative dz/dp is nice!
Train with gradient descent
Context vector
z
(D-dimensional)
From
RNN:
Slide Credit: CS231n 17
18.
Soft Attention
Xu etal. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
CNN
Image:
H x W x 3
Grid of features
(Each
D-dimensional)
a b
c d
pa
pb
pc
pd
Distribution over
grid locations
pa
+ pb
+ pc
+ pc
= 1
Soft attention:
Summarize ALL locations
z = pa
a+ pb
b + pc
c + pd
d
Differentiable function
Train with gradient descent
Context vector
z
(D-dimensional)
From
RNN:
Slide Credit: CS231n
● Still uses the whole input !
● Constrained to fix grid
18
19.
Hard attention
Input image:
Hx W x 3
Box Coordinates:
(xc, yc, w, h)
Cropped and
rescaled image:
X x Y x 3
Not a differentiable function !
Can’t train with backprop :(
19
Hard attention:
Sample a subset
of the input
need reinforcement learning
Gradient is 0 almost everywhere
Gradient is undefined at x = 0
20.
Hard attention
Gregor etal. DRAW: A Recurrent Neural Network For Image Generation. ICML 2015
Generate images by attending to
arbitrary regions of the output
Classify images by attending to
arbitrary regions of the input
20
Hard attention
22Graves. GeneratingSequences with Recurrent Neural Networks. arXiv 2013
Read text, generate handwriting using an RNN that attends at
different arbitrary regions over time
GENERATED
REAL
23.
Hard attention
Input image:
Hx W x 3
Box Coordinates:
(xc, yc, w, h)
Cropped and
rescaled image:
X x Y x 3
CNN
bird
Not a differentiable function !
Can’t train with backprop :( 23
24.
Spatial Transformer Networks
Inputimage:
H x W x 3
Box Coordinates:
(xc, yc, w, h)
Cropped and
rescaled image:
X x Y x 3
CNN
bird
Jaderberg et al. Spatial Transformer Networks. NIPS 2015
Not a differentiable function !
Can’t train with backprop :(
Make it differentiable
Train with backprop :) 24
25.
Spatial Transformer Networks
Jaderberget al. Spatial Transformer Networks. NIPS 2015
Input image:
H x W x 3
Box Coordinates:
(xc, yc, w, h)
Cropped and
rescaled image:
X x Y x 3
Can we make this
function differentiable?
Idea: Function mapping
pixel coordinates (xt, yt) of
output to pixel coordinates
(xs, ys) of input
Slide Credit: CS231n
Repeat for all pixels
in output to get a
sampling grid
Then use bilinear
interpolation to
compute output
Network
attends to
input by
predicting
25
26.
Spatial Transformer Networks
Jaderberget al. Spatial Transformer Networks. NIPS 2015
Easy to incorporate in any network, anywhere !
Differentiable module
Insert spatial transformers into a
classification network and it learns
to attend and transform the input
26
Visual Attention
Sharma etal. Action Recognition Using Visual Attention. arXiv 2016
Kuen et al. Recurrent Attentional Networks for Saliency Detection. CVPR 2016
Salient Object DetectionAction Recognition in Videos
29
30.
Other examples
Chen etal. Attention to Scale: Scale-aware Semantic Image Segmentation. CVPR 2016
You et al. Image Captioning with Semantic Attention. CVPR 2016
Attention to scale for
semantic segmentation
Semantic attention
For image captioning
30
31.
Resources
● CS231n Lecture@ Stanford [slides][video]
● More on Reinforcement Learning
● Soft vs Hard attention
● Handwriting generation demo
● Spatial Transformer Networks - Slides & Video by Victor Campos
● Attention implementations:
○ Seq2seq in Keras
○ DRAW & Spatial Transformers in Keras
○ DRAW in Lasagne
○ DRAW in Tensorflow
31