Chapter 12
Deep Neural Networks
(Part II)
LSTM
[Link]. Dương Tuấn Anh
7/2021
Outline
• 1. Long-short-term-memory (LSTM)
• 2. Applications of deep neural networks
• 3. Conclusions
2
1. Long Short Term Memory (LSTM)
Recurrent neural network
Recurrent neural network is a kind of artificial neural network
which is designed to deal with sequential series. An example of
stock price series is described in Figure 12.17.
• In Recurrent Neural Network,
the output of a hidden unit at
the current time step is
based on the input value at
the corresponding time
step and the output of the
hidden unit at the previous
time step. This helps the
network able to remember
the information at some
previous steps. Figure 12.17
3
Feed-forward neural network and recurrent neural network
(a)
(b)
Figure 12.18 (a) Feed-forward neural network (b) recurrent neural
network
Recurrent neural network: the output of a hidden unit can play the
role of the value of an input unit.
4
Recurrent neural network
• The internal operations of the recurrent cell in RNN
are described as in Figure 12.19.
Figure 12.19 Internal operation of a traditional RNN cell.
5
Recurrent neural network (cont.)
• Assume that 𝑥 = 𝑥1 , 𝑥2 , … , 𝑥𝑡 represents a data
sequence of length 𝑡, and ℎ𝑡 is the value of a hidden
node in RNN at time step 𝑡, the value of a hidden
node (i.e. the information stored in RNN) is
recomputed in every time step by the following
equation:
ℎ𝑡 = 𝜎(𝑊𝑥 𝑥𝑡 + 𝑊ℎ ℎ𝑡−1 + 𝑏𝑡 )
where 𝜎 is the activation function (e.g. sigmoid
function, tanh function, or ReLU (rectified linear unit)
function), 𝑊𝑥 and 𝑊ℎ are the adjustable weight vectors,
the input vector xt and 𝑏𝑡 is the bias vector.
6
ReLU function
• ReLU (rectified linear Unit) function is defined as follows:
(x) = max(0, x)
Figure 12.20 ReLU function 7
Recurrent neural network (cont.)
• As for RNN with some hidden layers, the training
process using Back-propagation through time algorithm
incurs the problem of the exploding or vanishing
gradients.
• Therefore, LSTM network has been proposed to
prevent the weaknesses of the recurrent neural
networks.
• Each LSTM unit consists of a cell state or cell memory
and 3 gates.
• A cell state in LSTM network is equivalent to a hidden
unit in RNN
8
LSTM network
• LSTM is the improved version of RNN, which was
proposed in 1997 by Hochreiter and Schmidhuber to deal
with the long-term dependencies of sequential series data.
• The traditional RNN can not remember the long-term
dependencies among data values in a long sequence,
therefore, the first data value in a sequence does not have
significant influence on the predicted values for the data at
the next time steps.
• LSTM network consists of several cell states which can
represent time-dependent information as hidden nodes in
RNN.
9
LSTM cell
• The main block of LSTM is called cell state. LSTM
can use or ignore the information which flows through
the gates and that is the method that control the
information flow with the LSTM cells. It uses sigmoid
activation function with the range [0, 1] to represent
the information which is allowed to flow. If sigmoid
function gives the value 0, there is no information
going through and if the function gives the value 1, all
the information is allowed to go through.
• Each LSTM cell consists of 3 gates to control and
protect the cell state: forget gate, input gate and
output gate.
10
LSTM cell
Figure 12.21 LSTM cell (block) 11
Forget gate
• Forget gate can control which elements of the cell
state vector will be forgotten.
• In Equation (1) 𝑓𝑡 is a resulting vector for forget gate
at the current time step, 𝜎 is the sigmoid function,
𝐶𝑡−1 and ℎ𝑡−1 represent cell state and output of
hidden unit at the previous time step, 𝑊𝑓 and 𝑏𝑓
represent the weight vector and the bias vector from
the input layer to the forget gate
𝑓𝑡 = 𝜎 𝑊𝑓 𝐶𝑡−1 , ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑓 (1)
12
Input gate
• Input gate can control the information which should enter
the cell state. This gate has two part: sigmoid layer and
tanh layer. Sigmoid layer selects the information
from ℎ𝑡−1 , 𝑥𝑡 and 𝐶𝑡−1 . Tanh layer generates the
candidate values which are added to the cell memory. The
output values of sigmoid layer and tanh layer are
computed as follows:
𝑖𝑡 = 𝜎(𝑊𝑖 𝐶𝑡−1 , ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑖 ) (2)
Ĉ𝑡 = tanh(𝑊𝑐 ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑐 ) (3)
where 𝑖𝑡 is the output value of the input gate; 𝑊𝑖 and 𝑏𝑖
represent weight vector and bias vector of the input. 𝑊𝑐 and
𝑏𝑐 in (3) represent weight vector and bias vector of the cell
state.
13
Cell state
• The cell state at the previous time step, denoted as
𝐶𝑡−1 , can be updated to be 𝐶𝑡 .
• This can be done by multiplying the value of the cell
state at the previous time step Ct-1 with ft and add it
with 𝑖𝑡 ∗ Ĉt , to become the new information needed to
be stored. This step is described as the following
equation.
𝐶𝑡 = 𝑓𝑡 ∗ 𝐶𝑡−1 + 𝑖𝑡 ∗ Ĉ𝑡 (4)
14
Output gate
• Output gate can determine which information from
the cell state can be chosen to be the output value of
the LSTM cell.
• In the Equations (5) and (6), 𝑜𝑡 is the output value of
the output gate, 𝑊𝑜 and 𝑏𝑜 represent weight vector
and bias vector from input gate to output gate, and
ℎ𝑡 is the output of the hidden layer at the current time
step.
𝑜𝑡 = 𝜎(𝑊𝑜 𝐶𝑡 , ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑜 (5)
ℎ𝑡 = 𝑜𝑡 ∗ tanh(𝐶𝑡 ) (6)
15
Architecture of LSTM network
• LSTM network contains more than one hidden layers. It
consists many layers and each layer consists of a number
of LSTM cells such that output values of the previous layer
become input values for the next layer.
• Architecture of LSTM network is described as in Figure
12.22.
16
Architecture of LSTM network
Figure 12.22 Architecture of LSTM network
17
Applications of LSTM networks
• Robot control
• Time series prediction
• Speech recognition.
• Music composition
• Grammar learning
• Natural Language processing
• Handwriting recognition
• Human action recognition
• Sentiment analysis
18
Time Series Forecasting with LSTM
Figure 12.23: Training LSTM for time
series forecasting
19
Training LSTM for time series forecasting
• Figure 12.23 illustrates one iteration step in the
training process of the LSTM. A random batch of
input data x consisting of m independent training
samples (depicted by the colours) is used in each
step. Each training sample consists of n data points
and one target value (yobs) to predict. The loss is
computed from the target value and the network’s
predictions ysim and is used to update the network
parameters.
EX: A time series : 2 3 5 4 6 8 5 7 11 13 9 7
20
2. Applications of Deep Neural
Networks
Applications:
• Speech and audio: speech recognition, audio
and music processing.
• Image and video: image recognition, computer
vision.
• Language modeling: machine translation, text
information retrieval.
• Time series prediction
21
3. Conclusions
• Building/learning deep architectures and hierarchies of
features is highly desirable.
• CNN is suitable to image data and LSTM is suitable to
sequential data.
• Deep learning is an emerging technology. Despite the
promising results reported so far, much need to be developed.
• The current optimization techniques for learning deep
architectures should be improved.
• To make deep learning techniques scalable to very large
training data, sound parallel learning algorithms or more
effective architectures than the existing ones need to be
developed.
• How to choose sensible values for hyper-parameters such as
learning rate schedule, the number of layers and the number
of units per layer, etc.
22
Terminology
• Recurrent neural network: mạng nơ ron hồi quy,
sequential series: chuỗi tuần tự, vanishing gradient: độ
dốc triệt tiêu, exploding gradient: độ dốc bùng nổ, long-
term dependency: sự phụ thuộc dài hạn, LSTM block:
khối LSTM, LSTM cell: tế bào LSTM, cell state: trạng
thái tế bào, forget gate: cổng quên, input gate: cổng
nhập, output gate: cổng xuất, sigmoid function: hàm
sigmoid, tanh function: hàm tanh, time series
prediction: dự báo chuỗi thời gian, sliding window: cửa
sổ trượt, input vector: véc tơ đầu vào, epoch: kỷ
nguyên
23