Multidimensional RNN

Multi-Dimensional RNNs
Grigory Sapunov
Moscow Computer Vision Group
Moscow, Yandex, 27.04.2016
gs@inten.to

Outline
● RNN intro
○ RNN vs. FFNN
○ LSTM/GRU
○ CTC
● Some architectural issues
○ BRNN
○ MDRNN
○ MDMDRNN
○ HSRNN
● Some recent results
○ (same ideas) 2D LSTM, CLSTM, C-RNN, C-HRNN
○ (new ideas) ReNet, PyraMiD-LSTM, Grid LSTM
● (Discussion) RNN vs CNN for Computer Vision
● Resources

Feedforward NN vs. Recurrent NN
Recurrent neural networks (RNNs) allow cyclical connections.

Unfolding the RNN and training using BPTT
Can do backprop on the unfolded network: Backpropagation through time (BPTT)
http://ir.hit.edu.cn/~jguo/docs/notes/bptt.pdf

Neural Network properties
Feedforward NN (FFNN):
● FFNN is a universal approximator: feed-forward network with a single hidden layer,
which contains finite number of hidden neurons, can approximate continuous functions
on compact subsets of Rn
, under mild assumptions on the activation function.
● Typical FFNNs have no inherent notion of order in time. They remember only training.
Recurrent NN (RNN):
● RNNs are Turing-complete: they can compute anything that can be computed and
have the capacity to simulate arbitrary procedures.
● RNNs possess a certain type of memory. They are much better suited to dealing with
sequences, context modeling and time dependencies.

RNN problem: Vanishing gradients
Solution: Long short-term memory (LSTM, Hochreiter, Schmidhuber, 1997)

LSTM: Fixing vanishing gradient problem

Comparing LSTM and Simple RNN
More on LSTMs: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Another solution: Gated Recurrent Unit (GRU)
GRU (Cho et al., 2014) is a bit simpler than LSTM (less weights)

Another useful thing: CTC Output Layer
CTC (Connectionist Temporal Classification; Graves, Fernández, Gomez,
Schmidhuber, 2006) was specifically designed for temporal classification tasks; that
is, for sequence labelling problems where the alignment between the inputs and the
target labels is unknown.
CTC models all aspects of the sequence with a single neural network, and does not
require the network to be combined with a hidden Markov model. It also does not
require presegmented training data, or external post-processing to extract the
label sequence from the network outputs.
The CTC network predicts only the sequence of phonemes (typically as a series
of spikes, separated by ‘blanks’, or null predictions), while the framewise network
attempts to align them with the manual segmentation.

Example: CTC vs. Framewise classification

End of Intro
So, further we will not make a distinction between RNN/GRU/LSTM, and will usually
be using the word RNN for any kind of internal block. Typically most RNNs now are
actually LSTMs.
Significant part of the presentation is based on works of Alex Graves et al.

Some interesting generalizations
of simple RNN architecture

#1 Directionality (BRNN/BLSTM)

Bidirectional RNN/LSTM
There are many situations when you see the whole sequence at once (OCR,
speech recognition, translation, caption generation, …).
So you can scan the [1-d] sequence in both directions, forward and backward.
Here comes BLSTM (Graves, Schmidhuber, 2005).

Typical result: BRNN>RNN, LSTM>RNN, BLSTM>BRNN

Example: BLSTM classifying the utterance “one oh five”

#2 Dimensionality (MDRNN/MDLSTM)

Multidimensional RNN/LSTM
Standard RNNs are inherently one dimensional, and therefore poorly suited to
multidimensional data (e.g. images).
The basic idea of MDRNNs (Graves, Fernandez, Schmidhuber, 2007) is to replace
the single recurrent connection found in standard RNNs with as many recurrent
connections as there are dimensions in the data.
It assumes some ordering on the multidimensional data.

MDRNN
The basic idea of MDRNNs is to replace the single recurrent connection found
in standard RNNs with as many recurrent connections as there are dimensions
in the data.

Uni-directionality
MDRNN assumes some ordering on the multidimensional data. And it’s not the
only possible one.

#3 Directionality + Dimensionality
(MDMDRNN?)

Multidirectional multidimensional RNN (MDMDRNN?)
The previously mentioned ordering is not the only possible one. It might be OK for
some tasks, but it is usually preferable for the network to have access to the
surrounding context in all directions. This is particularly true for tasks where precise
localisation is required, such as image segmentation.
For one dimensional RNNs, the problem of multidirectional context was solved by
the introduction of bidirectional recurrent neural networks (BRNNs). BRNNs contain
two separate hidden layers that process the input sequence in the forward and
reverse directions.
BRNNs can be extended to n-dimensional data by using 2n
separate hidden layers,
each of which processes the sequence using the ordering defined above, but with a
different choice of axes.

Multi-directionality
As before, the hidden layers are connected to a single output layer, which now
has access to all surrounding context

MDMDRNN example: Air Freight database (2007)
A ray-traced colour image sequence that comes with a ground truth segmentation
into the different textures mapped onto the 3-d models. The sequence is 455
frames (160x120 px) long and contains 155 distinct textures.

Network structure:
● Multidirectional 2D LSTM.
● 4 layers (not levels! just 4 directional layers on a single level) consisted of 25
memory blocks, each containing 1 cell, 2 forget gates, 1 input gate, 1 output
gate and 5 peephole weights.
● The input and output activation function of the cells was tanh, and the
activation function for the gates was the logistic sigmoid.
● The input layer was size 3 (RGB) and the output layer (softmax) was size 155
(one unit for each texture).
● The network contained 43,257 trainable weights in total.
● The final pixel classification error rate, after 330 training epochs, was 7.1% on
the test set.

MDMDRNN example: MNIST (2007)
Additional evaluation on the warped dataset (not used in training at all)

#4 Hierarchical subsampling (HSRNN)

Hierarchical Subsampling Networks (HSRNN)
So-called hierarchical subsampling is commonly used in fields such as computer
vision where the volume of data is too great to be processed by a ‘flat’
architecture. As well as reducing computational cost, it also reduces the effective
dispersal of the data, since inputs that are widely separated at the bottom of the
hierarchy are transformed to features that are close together at the top.
A hierarchical subsampling recurrent neural network (HSRNN, Graves and
Schmidhuber, 2009) consists of an input layer, an output layer and multiple levels of
recurrently connected hidden layers. The output sequence of each level in the
hierarchy is used as the input sequence for the next level up. All input sequences
are subsampled using subsampling windows of predetermined width. The structure
is similar to that used by convolutional networks, except with recurrent, rather
than feedforward, hidden layers.

HSRNN
For each layer in the hierarchy, the forward pass equations are identical to those for
a standard RNN, except that the sum over input units is replaced by a sum of sums
over the subsampling window.
A good rule of thumb is to choose the layer sizes so that each level consumes
roughly half the processing time of the level below.

HSRNN
Can be easily extended into multidimensional and multidirectional case.
The problem is that each level of the hierarchy requires 2n
hidden layers instead of
one. To connect every layer at one level to every layer at the next therefore requires
O(22n
) weights.
One way to reduce the number of weights is to separate the levels with nonlinear
feedforward layers, which reduces the number of weights between the levels to O
(2n
)—the same as standard MDRNNs.
As a rule of thumb, giving each feedforward layer between half and one times as
many units as the combined hidden layers in the level below appears to work well in
practice.

HSRNN example: Arabic handwriting recognition
Network structure:
● The hierarchy contained three levels, multidirectional MDLSTM (so 4 hidden
layers for 2D data).
● The three levels were separated by two feedforward layers with the tanh
activation function.
● Subsampling windows were applied in three places: to the input sequence, to
the output sequence of the first hidden level, and to the output sequence of
the second hidden level.

Offline arabic handwriting recognition (2009)
● 32,492 black-and-white images of individual handwritten Tunisian town and
village names, of which we used 30,000 for training, and 2,492 for validation
● Each image was supplied with a manual transcription for the individual
characters, and the postcode of the corresponding town. There were 120
distinct characters in total
● The task was to identify the postcode, from a list of 937 town names and
corresponding postcodes. Many of the town names had transcription variants,
giving a total of 1,518 entries in the complete postcode lexicon.
● The test data (which is not published) was divided into sets ‘f’ and ‘s’. The
main competition results were based on set ‘f’. Set ‘s’ contains data collected
in the United Arab Emirates using the same forms; its purpose was to test the
robustness of the recognisers to regional writing variations

Offline arabic handwriting recognition (2009)

Some more recent examples
using the same ideas

Example #1:
Scene Labeling with 2D LSTM
(CVPR 2015)
http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Byeon_Scene_Labeling_With_2015_CVPR_paper.pdf

2D LSTM for Scene Labeling (2015)

Scene Labeling with LSTM Recurrent Neural Networks / CVPR 2015,

Scene Labeling with LSTM Recurrent Neural Networks / CVPR 2015

Example #2:
Convolutional LSTM (CLSTM)
(ILSVRC 2015)
http://image-net.org/challenges/posters/ILSVRC2015_Poster_VUNO.pdf

Convolutional LSTM (CLSTM) (2015)
Actually an LSTM over the last layers of CNN.
“Among various models, multi-dimensional recurrent neural network, specifically
multi-dimensional long-short term memory (MD-LSTM) has shown promising
results and can be naturally integrated and trained ‘end-to-end’ fashion. However,
when we try to learn the structure with very low level representation such as input
pixel level, the dependency structure can be too noisy or spatially long-term
dependency information can be vanished while training. Therefore, we propose to
use 2D-LSTM layer on top of convolutional layers by taking advantage of
convolution layers to extract high level representation of the image and 2D-LSTM
layer to learn global spatial dependencies. We call this network as convolutional
LSTM (CLSTM)”

Convolutional LSTM (CLSTM) (2015)
“Our CLSTM models are constructed by replacing the last
two convolution layer of CNN with two 2D LSTM layers.
Since we used multidirectional 2D LSTM, there are 2^2
directional nodes for each location of feature map.”

Example #3:
Convolutional RNN (C-RNN)
(CVPR 2015)
http://www.cv-foundation.org/openaccess/content_cvpr_workshops_2015/W03/papers/Zuo_Convolutional_Recurrent_Neural_2015_CVPR_paper.pdf

Convolutional RNN (C-RNN) (2015)
“The C-RNN is trained in an end-to-end manner from raw pixel images. CNN
layers are firstly processed to generate middle level features. RNN layer is then
learned to encode spatial dependencies.”
“In [13], MDLSTM was proposed to solve the handwriting recognition problem by
using RNN. Different from this work, we utilize quad-directional 1D RNN
instead of their 2D RNN, our RNN is simpler and it has fewer parameters, but it
can already cover the context from all directions. Moreover, our C-RNN make both
use of the discriminative representation power of CNN and contextual information
modeling capability of RNN, which is more powerful for solving large scale image
classification problem.”
Funny, it’s not an LSTM. Just simple RNN.

“Our C-RNN had the same settings with Alex-net, except that it directly connects
the output of the fifth convolutional layer to the sixth fully connected layer, while
our C-RNN uses the RNN to connect the fifth convolutional layer and the fully
connected layers”

Example #3B:
Convolutional Hierarchical RNN (C-
HRNN) (2015)
http://arxiv.org/abs/1509.03877

Convolutional hierarchical RNN (C-HRNN) (2015)
“In Hierarchical RNNs (HRNNs), each RNN layer focuses on modeling spatial
dependencies among image regions from the same scale but different locations.
While the cross RNN scale connections target on modeling scale dependencies
among regions from the same location but different scales.”
Finally with LSTM:
“Specifically, we propose two recurrent neural network models: 1) hierarchical
simple recurrent network (HSRN), which is fast and has low computational cost;
and 2) hierarchical long-short term memory recurrent network (HLSTM), which
performs better than HSRN with the price of more computational cost.”

“Thus, inspired by [22], we generate “2D sequences” for images, and each element
simultaneously receives spatial contextual references from its 2D neighborhood
elements.”

Some more recent examples
using new ideas

Example #4:
ReNet (2015)
[Francesco Visin, Kyle Kastner, Kyunghyun Cho, Matteo Matteucci, Aaron Courville, Yoshua Bengio]

ReNet (2015)
“Our model relies on purely uni-dimensional RNNs coupled in a novel way, rather
than on a multi-dimensional RNN. The basic idea behind the proposed ReNet
architecture is to replace each convolutional layer (with convolution+pooling
making up a layer) in the CNN with four RNNs that sweep over lower-layer
features in different directions: (1) bottom to top, (2) top to bottom, (3) left to right
and (4) right to left.”
“The main difference between ReNet and the model of Graves and Schmidhuber
[2009] is that we use the usual sequence RNN, instead of the multidimensional
RNN.“

ReNet (2015)
“One important consequence of the proposed approach
compared to the multidimensional RNN is that the
number of RNNs at each layer scales now linearly with
respect to the number of dimensions d of the input
image (2d). A multidimensional RNN, on the other
hand, requires the exponential number of RNNs at each
layer (2d
). Furthermore, the proposed variant is more
easily parallelizable, as each RNN is dependent only
along a horizontal or vertical sequence of patches. This
architectural distinction results in our model being much
more amenable to distributed computing than that of
Graves and Schmidhuber [2009]”.
… But for d=2 2d == 2d

Example #5: “The Empire Strikes Back”
PyraMiD-LSTM (2015)
[Marijn F. Stollenga, Wonmin Byeon, Marcus Liwicki, Juergen Schmidhuber]

PyraMiD-LSTM (2015)
“Multi-Dimensional Recurrent NNs (MD-RNNs) can perceive the entire spatio-
temporal context of each pixel in a few sweeps through all pixels, especially when
the RNN is a Long Short-Term Memory (LSTM). Despite these theoretical
advantages, however, unlike CNNs, previous MD-LSTM variants were hard to
parallelize on GPUs. Here we re-arrange the traditional cuboid order of
computations in MD-LSTM in pyramidal fashion. The resulting PyraMiD-LSTM is
easy to parallelize, especially for 3D data such as stacks of brain slice images.”

PyraMiD-LSTM (2015)
“One of the striking differences between PyraMiD-LSTM and MD-LSTM is the
shape of the scanned contexts. Each LSTM of an MD-LSTM scans rectangle-
like contexts in 2D or cuboids in 3D. Each LSTM of a PyraMiD-LSTM scans
triangles in 2D and pyramids in 3D. An MD-LSTM needs 8 LSTMs to scan a
volume, while a PyraMiD-LSTM needs only 6, since it takes 8 cubes or 6 pyramids
to fill a volume. Given dimension d, the number of LSTMs grows as 2d
for an MD-
LSTM (exponentially) and 2 × d for a PyraMiD-LSTM (linearly).”

PyraMiD-LSTM (2015)
On the MR brain dataset, training took around three days, and testing per volume
took around 2 minutes. Networks contain three PyraMiD-LSTM layers:
1. 16 hidden units + fully-connected layer with 25 hidden units;
2. 32 hidden units + fully-connected layer with 45 hidden units;
3. 64 hidden units + fully-connected output layer whose size #classes.
“Previous MD-LSTM implementations, however, could not exploit the parallelism
of modern GPU hardware. This has changed through our work presented here.
Although our novel highly parallel PyraMiD-LSTM has already achieved state-of-
the-art segmentation results in challenging benchmarks, we feel we have only
scratched the surface of what will become possible with such PyraMiD-
LSTM and other MD-RNNs.”

Example #6: Grid LSTM (ICLR 2016)
(Graves again!)
[Nal Kalchbrenner, Ivo Danihelka, Alex Graves]

Grid LSTM (2016)
“This paper introduces Grid Long Short-Term Memory, a network of LSTM cells
arranged in a multidimensional grid that can be applied to vectors, sequences
or higher dimensional data such as images. The network differs from existing
deep LSTM architectures in that the cells are connected between network layers
as well as along the spatiotemporal dimensions of the data. The network provides
a unified way of using LSTM for both deep and sequential computation.”

Grid LSTM (2016)
“Deep networks suffer from exactly the same problems as recurrent networks
applied to long sequences: namely that information from past computations rapidly
attenuates as it progresses through the chain – the vanishing gradient problem
(Hochreiter, 1991) – and that each layer cannot dynamically select or ignore its
inputs. It therefore seems attractive to generalise the advantages of LSTM to deep
computation.”
Can be N-dimensional. N-dimensional Grid LSTM is called N-LSTM for short.

Grid LSTM (2016)
One-dimensional Grid LSTM corresponds to a feed-forward network that uses
LSTM cells in place of transfer functions such as tanh and ReLU. These networks
are related to Highway Networks (Srivastava et al., 2015) where a gated transfer
function is used to successfully train feed-forward networks with up to 900 layers
of depth.
Grid LSTM with two dimensions is analogous to the Stacked LSTM, but it adds
cells along the depth dimension too.
Grid LSTM with three or more dimensions is analogous to Multidimensional
LSTM, but differs from it not just by having the cells along the depth dimension,
but also by using the proposed mechanism for modulating the N-way interaction
that is not prone to the instability present in Multidimesional LSTM.

Grid LSTM (2016)
The difference with the Multidimensional LSTM is that we apply multiple layers
of depth to the image, use three-dimensional blocks and concatenate the top
output vectors before classification.
The difference with the ReNet architecture is that the 3-LSTM processes the
image according to the two inherent spatial dimensions; instead of stacking hidden
layers as in the ReNet, the block also modulates directly what information is
passed along the depth dimension.

Time for Discussion:
RNN vs. CNN for Computer Vision

Resources (The Classic)
- Multi-Dimensional Recurrent Neural Networks,
Alex Graves, Santiago Fernandez, Juergen Schmidhuber
https://arxiv.org/abs/0705.2011
- Offline Handwriting Recognition with Multidimensional Recurrent Neural
Networks / NIPS 2009,
Alex Graves, Juergen Schmidhuber http://papers.nips.cc/paper/3449-offline-handwriting-recognition-with-
multidimensional-recurrent-neural-networks

Resources (The Classic)
- Supervised Sequence Labelling with Recurrent Neural Networks
Alex Graves, Springer, 2012
http://www.springer.com/us/book/9783642247965
https://www.cs.toronto.edu/~graves/preprint.pdf
- RNNLIB https://sourceforge.net/projects/rnnl/
- http://deeplearning.cs.cmu.edu/slides.2015/20.graves.pdf

Resources (more recent)
- Scene Labeling with LSTM Recurrent Neural Networks / CVPR 2015,
Wonmin Byeon, Thomas M. Breuel, Federico Raue, Marcus Liwicki http://www.cv-
foundation.org/openaccess/content_cvpr_2015/papers/Byeon_Scene_Labeling_With_2015_CVPR_paper.pdf
- Deep Convolutional and Recurrent Neural Network for Object Classification
and Localization / ILSVRC 2015
http://image-net.org/challenges/posters/ILSVRC2015_Poster_VUNO.pdf
- Convolutional Recurrent Neural Networks: Learning Spatial Dependencies
for Image Representation / CVPR 2015 http://www.cv-foundation.
org/openaccess/content_cvpr_workshops_2015/W03/papers/Zuo_Convolutional_Recurrent_Neural_2015_CVPR_paper.pdf

- ReNet: A Recurrent Neural Network Based Alternative to Convolutional
Networks
Francesco Visin, Kyle Kastner, Kyunghyun Cho, Matteo Matteucci, Aaron Courville,
Yoshua Bengio
- Parallel Multi-Dimensional LSTM, With Application to Fast Biomedical
Volumetric Image Segmentation
Marijn F. Stollenga, Wonmin Byeon, Marcus Liwicki, Juergen Schmidhuber
- Grid Long Short-Term Memory
Nal Kalchbrenner, Ivo Danihelka, Alex Graves

- OCRopus
https://github.com/tmbdev/ocropy
https://github.com/tmbdev/clstm
The problem is MDRNNs are mostly unsupported. Not enough modern libraries
available.

More to come
- Graph LSTM http://arxiv.org/pdf/1603.07063v1.pdf
- Local-Global LSTM http://arxiv.org/pdf/1511.04510v1.pdf
- …

https://ru.linkedin.com/in/grigorysapunov
gs@inten.to
Thanks!

Multidimensional RNN

More Related Content

What's hot

Similar to Multidimensional RNN

More from Grigory Sapunov

Recently uploaded

Multidimensional RNN