0% found this document useful (0 votes)
40 views8 pages

Reinforced Mnemonic Reader for MRC

The document presents the Reinforced Mnemonic Reader, a novel architecture for machine reading comprehension that addresses attention redundancy and deficiency through a reattention mechanism and dynamic-critical reinforcement learning. The model improves upon existing systems by refining attention through past interactions and optimizing training to avoid convergence suppression. Extensive experiments demonstrate its effectiveness, achieving state-of-the-art results on the Stanford Question Answering Dataset (SQuAD) and outperforming previous models by over 6% in key metrics.

Uploaded by

59457z8qkc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views8 pages

Reinforced Mnemonic Reader for MRC

The document presents the Reinforced Mnemonic Reader, a novel architecture for machine reading comprehension that addresses attention redundancy and deficiency through a reattention mechanism and dynamic-critical reinforcement learning. The model improves upon existing systems by refining attention through past interactions and optimizing training to avoid convergence suppression. Extensive experiments demonstrate its effectiveness, achieving state-of-the-art results on the Stanford Question Answering Dataset (SQuAD) and outperforming previous models by over 6% in key metrics.

Uploaded by

59457z8qkc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Reinforced Mnemonic Reader for Machine Reading Comprehension

Minghao Hu†∗ , Yuxing Peng† , Zhen Huang† , Xipeng Qiu‡ , Furu Wei§ , Ming Zhou§

College of Computer, National University of Defense Technology, Changsha, China

School of Computer Science, Fudan University, Shanghai, China
§
Microsoft Research, Beijing, China
{huminghao09,pengyuhang,huangzhen}@[Link]
xpqiu@[Link], {fuwei,mingzhou}@[Link]

Abstract coattention [Xiong et al., 2017b], are proposed in a


arXiv:1705.02798v6 [[Link]] 6 Jun 2018

single-round alignment architecture. In order to fully


In this paper, we introduce the Reinforced compose complete information of the inputs, multi-
Mnemonic Reader for machine reading compre- round alignment architectures that compute attentions
hension tasks, which enhances previous attentive repeatedly have been proposed [Huang et al., 2017;
readers in two aspects. First, a reattention mech- Xiong et al., 2017a]. However, in these approaches, the
anism is proposed to refine current attentions by di- current attention is unaware of which parts of the con-
rectly accessing to past attentions that are tempo- text and question have been focused in earlier attentions,
rally memorized in a multi-round alignment archi- which results in two distinct but related issues, where
tecture, so as to avoid the problems of attention re- multiple attentions 1) focuses on same texts, leading to
dundancy and attention deficiency. Second, a new attention redundancy and 2) fail to focus on some salient
optimization approach, called dynamic-critical re- parts of the input, causing attention deficiency.
inforcement learning, is introduced to extend the
standard supervised method. It always encourages 2. To train the model, standard maximum-likelihood
to predict a more acceptable answer so as to address method is used for predicting exactly-matched (EM) an-
the convergence suppression problem occurred in swer spans [Wang and Jiang, 2017]. Recently, rein-
traditional reinforcement learning algorithms. Ex- forcement learning algorithm, which measures the re-
tensive experiments on the Stanford Question An- ward as word overlap between the predicted answer and
swering Dataset (SQuAD) show that our model the groung truth, is introduced to optimize towards the
achieves state-of-the-art results. Meanwhile, our F1 metric instead of EM metric [Xiong et al., 2017a].
model outperforms previous systems by over 6% in Specifically, an estimated baseline is utilized to normal-
terms of both Exact Match and F1 metrics on two ize the reward and reduce variances. However, the con-
adversarial SQuAD datasets. vergence can be suppressed when the baseline is bet-
ter than the reward. This is harmful if the inferior re-
ward is partially overlapped with the ground truth, as
1 Introduction the normalized objective will discourage the prediction
Teaching machines to comprehend a given context paragraph of ground truth positions. We refer to this case as the
and answer corresponding questions is one of the long-term convergence suppression problem.
goals of natural language processing and artificial intelli-
gence. Figure 1 gives an example of the machine reading To address the first problem, we present a reattention mech-
comprehension (MRC) task. Benefiting from the rapid devel- anism that temporally memorizes past attentions and uses
opment of deep learning techniques [Goodfellow et al., 2016] them to refine current attentions in a multi-round alignment
and large-scale benchmark datasets [Hermann et al., 2015; architecture. The computation is based on the fact that two
Hill et al., 2016; Rajpurkar et al., 2016], end-to-end neu- words should share similar semantics if their attentions about
ral networks have achieved promising results on this task same texts are highly overlapped, and be less similar vice
[Wang et al., 2017; Seo et al., 2017; Xiong et al., 2017a; versa. Therefore, the reattention can be more concentrated
Huang et al., 2017]. if past attentions focus on same parts of the input, or be rel-
Despite of the advancements, we argue that there still exists atively more distracted so as to focus on new regions if past
two limitations: attentions are not overlapped at all.
As for the second problem, we extend the traditional train-
1. To capture complex interactions between the context and ing method with a novel approach called dynamic-critical re-
the question, a variety of neural attention [Dzmitry Bah- inforcement learning. Unlike the traditional reinforcement
danau, 2015], such as bi-attention [Seo et al., 2017], learning algorithm where the reward and baseline are stat-
∗ ically sampled, our approach dynamically decides the re-
Contribution during internship at Fudan University and Mi-
crosoft Research. ward and the baseline according to two sampling strategies,
Aligning Rounds Attention
Context: The American Football Conference (AFC) Model
Interactive Self Type
champion Denver Broncos defeated the National Match-LSTM1 1 - Serial
Football Conference (NFC) champion Carolina Rnet2 1 1 Serial
Panthers 24–10 to earn their third Super Bowl title. BiDAF3 1 - Parallel
FastQAExt4 1 1 Parallel
DCN+5 2 2 Parallel
Question: Which NFL team represented the AFC at
FusionNet6 3 1 Parallel
Super Bowl 50? Our Model 3 3 Parallel
Answer: Denver Broncos
Table 1: Comparison of alignment architectures of competing mod-
Figure 1: An example from the SQuAD dataset. Evidences needed els: Wang & Jiang[2017]1 , Wang et al.[2017]2 , Seo et al.[2017]3 ,
for the answer are marked as green. Weissenborn et al.[2017]4 , Xiong et al.[2017a]5 and Huang et
al.[2017]6 .

namely random inference and greedy inference. The result and U = {uj }m j=1 , representing question and context respec-
with higher score is always set to be the reward while the
other is the baseline. In this way, the normalized reward is tively, a similarity matrix E ∈ Rn×m is computed as
ensured to be always positive so that no convergence suppres- Eij = f (vi , uj ) (1)
sion will be made.
All of the above innovations are integrated into a new where Eij indicates the similarity between i-th question word
end-to-end neural architecture called Reinforced Mnemonic and j-th context word, and f is a scalar function. Different
Reader in Figure 3. We conducted extensive experiments on methods are proposed to normalize the matrix, resulting in
both the SQuAD [Rajpurkar et al., 2016] dataset and two variants of attention such as bi-attention[Seo et al., 2017] and
adversarial SQuAD datasets [Jia and Liang, 2017] to eval- coattention [Xiong et al., 2017b]. The attention is then used
uate the proposed model. On SQuAD, our single model ob- to attend the question and form a question-aware context rep-
tains an exact match (EM) score of 79.5% and F1 score of resentation H = {hj }m j=1 .
86.6%, while our ensemble model further boosts the result Later, Wang et al. [2017] propose a serial self aligning
to 82.3% and 88.5% respectively. On adversarial SQuAD, method to align the context aginst itself for capturing long-
our model surpasses existing approahces by more than 6% on term dependencies among context words. Weissenborn et
both AddSent and AddOneSent datasets. al. [Weissenborn et al., 2017] apply the self alignment in
a similar way of Eq. 1, yielding another similarity matrix
B ∈ Rm×m as
2 MRC with Reattention
Bij = 1{i6=j} f (hi , hj ) (2)
2.1 Task Description
where 1{·} is an indicator function ensuring that the context
For the MRC tasks, a question Q and a context C are given, word is not aligned with itself. Finally, the attentive informa-
our goal is to predict an answer A, which has different forms tion can be integrated to form a self-aware context represen-
according to the specific task. In the SQuAD dataset [Ra- tation Z = {zj }m j=1 , which is used to predict the answer.
jpurkar et al., 2016], the answer A is constrained as a seg- We refer to the above process as a single-round alignment
ment of text in the context C, nerual networks are designed architecture. Such architecture, however, is limited in its ca-
to model the probability distribution p(A|C, Q). pability to capture complex interactions among question and
context. Therefore, recent works build multi-round align-
2.2 Alignment Architecture for MRC ment architectures by stacking several identical aligning lay-
Among all state-of-the-art works for MRC, one of the key ers [Huang et al., 2017; Xiong et al., 2017a]. More specifi-
factors is the alignment architecture. That is, given the hid- cally, let V t = {vit }ni=1 and U t = {utj }m
j=1 denote the hid-
den representations of question and context, we align each den representations of question and context in t-th layer, and
context word with the entire question using attention mecha- H t = {htj }m j=1 is the corresponding question-aware context
nisms, and enhance the context representation with the atten- representation. Then the two similarity matrices can be com-
tive question information. A detailed comparison of different puted as
alignment architectures is shown in Table 1. t
Eij = f (vit , utj ), t
Bij = 1{i6=j} f (hti , htj ) (3)
Early work for MRC, such as Match-LSTM [Wang and
Jiang, 2017], utilizes the attention mechanism stemmed from However, one problem is that each alignment is not directly
neural machine translation [Dzmitry Bahdanau, 2015] seri- aware of previous alignments in such architecture. The atten-
ally, where the attention is computed inside the cell of recur- tive information can only flow to the subsequent layer through
rent neural networks. A more popular approach is to com- the hidden representation. This can cause two problems: 1)
pute attentions in parallel, resulting in a similarity matrix. the attention redundancy, where multiple attention distribu-
Concretely, given two sets of hidden vectors, V = {vi }ni=1 tions are highly similar. Let softmax(x) denote the softmax
function over a vector x. Then this problem can be formulized … AFC … Broncos ... NFC … Panthers ...
as D(softmax(E:jt )ksoftmax(E:jk )) < σ(t 6= k), where σ

...
is a small bound and D is a function measuring the distri-
bution distance. 2) the attention deficiency, which means team 0.3 0.2 0.1 0.2
that the attention fails to focus on salient parts of the input:

...

D(softmax(E:jt )ksoftmax(E:jt )) > δ, where δ is another AFC softmax(Ei:t-1)

bound and softmax(E:jt ) is the “ground truth” attention dis-
tribution.
… AFC … Broncos ... NFC … Panthers ...
2.3 Reattention Mechanism

...
To address these problems, we propose to temporally memo- AFC 0.4 0.1

...
rize past attentions and explicitly use them to refine current at-
tentions. The intuition is that two words should be correlated Broncos 0 0.3
if their attentions about same texts are highly overlapped, and

...
be less related vice versa. For example, in Figure 2, sup- NFC 0.2 0.4
pose that we have access to previous attentions, and then we

...
can compute their dot product to obtain a “similarity of atten-
tion”. In this case, the similarity of word pair (team, Broncos)
Panthers 0.3 0

...
is higher than (team, Panthers).
Therefore, we define the computation of reattention as fol- softmax(B:jt-1) softmax(B:kt-1)
lows. Let E t−1 and B t−1 denote the past similarity matrices
that are temporally memorized. The refined similarity matrix
E t (t > 1) is computed as softmax(Ei:t-1) ∙ softmax(B:jt-1) ≈ 0.2
t
Ẽij =softmax(Ei:t−1 ) · softmax(B:j
t−1
) softmax(Ei:t-1) ∙ softmax(B:kt-1) ≈ 0.13
t
Eij =f (vit , utj ) + γ Ẽij
t
(4)
Figure 2: Illustrations of reattention for the example in Figure 1.
where γ is a trainable parameter. Here, softmax(Ei:t−1 ) is the
past context attention distribution for the i-th question word,
t−1
and softmax(B:j ) is the self attention distribution for the j- where yk1 and yk2 are the answer span for the k-th example,
th context word. In the extreme case, when there is no overlap and we denote p1 (i|C, Q; θ) and p2 (j|i, C, Q; θ) as p1 (i) and
between two distributions, the dot product will be 0. On the p2 (j|i) respectively for abbreviation.
other hand, if the two distributions are identical and focus on Recently, reinforcement learning (RL), with the task re-
one single word, it will have a maximum value of 1. There- ward measured as word overlap between predicted answer
fore, the similarity of two words can be explicitly measured and groung truth, is introduced to MRC [Xiong et al., 2017a].
using their past attentions. Since the dot product is relatively A baseline b, which is obtained by running greedy inference
small than the original similarity, we initialize the γ with a with the current model, is used to normalize the reward and
tunable hyper-parameter and keep it trainable. The refined reduce variances. Such approach is known as the self-critical
similarity matrix can then be normalized for attending the sequence training (SCST) [Rennie et al., 2016], which is first
question. Similarly, we can compute the refined matrix B t used in image caption. More specifically, let R(As , A∗ ) de-
to get the unnormalized self reattention as note the F1 score between a sampled answer As and the
ground truth A∗ . The training objective is to minimize the
t
B̃ij =softmax(Bi:t−1 ) · softmax(B:j t−1
)
  negative expected reward by
t
Bij =1(i6=j) f (hti , htj ) + γ B̃ij
t
(5)
LSCST (θ) = −EAs ∼pθ (A) [R(As ) − R(Â)] (8)

3 Dynamic-critical Reinforcement Learning where we abbreviate the model distribution p(A|C, Q; θ) as


pθ (A), and the reward function R(As , A∗ ) as R(As ). Â is
In the extractive MRC task, the model distribution obtained by greedily maximizing the model distribution:
p(A|C, Q; θ) can be divided into two steps: first predicting
the start position i and then the end position j as  = arg max p(A|C, Q; θ)
A
p(A|C, Q; θ) = p1 (i|C, Q; θ)p2 (j|i, C, Q; θ) (6)
The expected gradient ∇θ LSCST (θ) can be computed ac-
where θ represents all trainable parameters.
cording to the REINFORCE algorithm [Sutton and Barto,
The standard maximum-likelihood (ML) training method
1998] as
is to maximize the log probabilities of the ground truth an-
swer positions [Wang and Jiang, 2017] ∇θ LSCST (θ) = −EAs ∼pθ (A) [(R(As ) − b) ∇θ log pθ (As )]
X  
LM L (θ) = − log p1 (yk1 ) + log p2 (yk2 |yk1 ) (7) ≈ − R(As ) − R(Â) ∇θ log pθ (As ) (9)
k
where the gradient can be approxiamated using a single p1 p2
Monte-Carlo sample As derived from pθ .
vit
However, a sampled answer is discouraged by the objec- Answer Pointer Ei:t-1 ... ...

tive when it is worse than the baseline. This is harmful if


Eijt
the answer is partially overlapped with ground truth, since
the normalized objective would discourage the prediction of R3 ... s B:jt-1
ground truth positions. For example, in Figure 1, suppose that ujt
As is champion Denver Broncos and  is Denver Broncos. ... ...

Although the former is an acceptable answer, the normalized


1 Reattention
reward would be negative and the prediction for end position
would be suppressed, thus hindering the convergence. We
refer to this case as the convergence suppression problem. Evidence
Here, we consider both random inference and greedy in- 2 hit
Collection ... ...
ference as two different sampling strategies: the first one en- 2
courages exploration while the latter one is for exploitation1 . Self Bi:t-1 Bijt
1
Therefore, we approximate the expected gradient by dynam- Alignment
ically set the reward and baseline based on the F1 scores of 1

both As and Â. The one with higher score is set as reward, Interactive B:jt-1 hjt
while the other is baseline. We call this approach as dynamic- Alignment ... ...

critical reinforcement learning (DCRL)


2 Self Reattention
∇θ LDCRL (θ) = −EAs ∼pθ (A) [(R(As ) − b) ∇θ log pθ (As )] Iterative Reattention Aligner
 
≈ −1{R(As )≥R(Â)} R(As ) − R(Â) ∇θ log pθ (As )
  ... U V ...
− 1{R(Â)>R(As )} R(Â) − R(As ) ∇θ log pθ (Â) (10)
Word Embedding
Notice that the normalized reward is constantly positive so XC Char Embedding XQ
that superior answers are always encouraged. Besides, when Contextual Embedding

the score of random inference is higher than the greedy one, Encoder
DCRL is equivalent to SCST. Thus, Eq. 9 is a special case of
Eq. 10.
Following [Xiong et al., 2017a] and [Kendall et al., 2017], Figure 3: The architecture overview of Reinforced Mnemonic
Reader. The subfigures to the right show detailed demonstrations
we combine ML and DCRL objectives using homoscedastic
of the reattention mechanism: 1) refined E t to attend the query; 2)
uncertainty as task-dependent weightings so as to stabilize the refined B t to attend the context.
RL training as
1 1 Besides, a character-level embedding is obtained by encoding
L= 2
LM L + 2 LDCRL + log σa2 + log σb2 (11)
2σa 2σb the character sequence with a bi-directional long short-term
memory network (BiLSTM) [Hochreiter and Schmidhuber,
where σa and σb are trainable parameters.
1997], where two last hidden states are concatenated to form
the embedding. In addition, we use binary feature of exact
4 End-to-end Architecture match, POS embedding and NER embedding for both ques-
Based on previous innovations, we introduce an end-to-end tion and context, as suggested in [Chen et al., 2017]. Together
architecture called Reinforced Mnemonic Reader, which is the inputs X Q = {xqi }ni=1 and X C = {xcj }mj=1 are obtained.
shown in Figure 3. It consists of three main components: 1) To model each word with its contextual information, a
an encoder builds contextual representations for question and weight-shared BiLSTM is utilized to perform the encoding
context jointly; 2) an iterative aligner performs multi-round vi = BiLSTM(xqi ), uj = BiLSTM(xcj ) (12)
alignments between question and context with the reattention
mechanism; 3) an answer pointer predicts the answer span se- Thus, the contextual representations for both question and
quentially. Beblow we give more details of each component. context words can be obtained, denoted as two matrices:
Encoder. Let W Q = {wiq }ni=1 and W C = {wjc }m V = [v1 , ..., vn ] ∈ R2d×n and U = [u1 , ..., um ] ∈ R2d×m .
j=1 denote
the word sequences of the question and context respectively. Iterative Aligner. The iterative aligner contains a stack of
The encoder firstly converts each word to an input vector. three aligning blocks. Each block consists of three modules:
We utilize the 100-dim GloVe embedding [Pennington et al., 1) an interactive alignment to attend the question into the con-
2014] and 1024-dim ELMo embedding [Peters et al., 2018]. text; 2) a self alignment to attend the context against itself;
3) an evidence collection to model the context representation
1
In practice we found that a better approximation can be made with a BiLSTM. The reattention mechanism is utilized be-
by considering a top-K answer list, where  is the best result and tween two blocks, where past attentions are temporally mem-
As is sampled from the rest of the list. orizes to help modulating current attentions. Below we first
Evidence
Interactive Alignment Self Alignment Collection
R1 , Z 1 , E 1 , B 1 = align1 (U, V )
u1 h1 z1 r1
R2 , Z 2 , E 2 , B 2 = align2 (R1 , V, E 1 , B 1 )
u2 Fusion h2 Fusion z2 r2
E B BiLSTM R3 , Z 3 , E 3 , B 3 = align3 (R2 , V, E 2 , B 2 , Z 1 , Z 2 ) (14)

...
...

...

...
um hm zm rm
v1 h1 where alignt denote the t-th block. In the t-th block (t > 1),
v1 we fix the hidden representation of question as V , and set
v2 h2
the hidden representation of context as previous fully-aware
...

...
...

vn context vectors Rt−1 . Then we compute the unnormalized


vm hm
reattention E t and B t with Eq. 4 and Eq. 5 respectively. In
addition, we utilize a residual connection [He et al., 2016] in
the last BiLSTM to form the final fully-aware context vectors
Figure 4: The detailed overview of a single aligning block. Different 
colors in E and B represent different degrees of similarity. R3 = [r13 , ..., rm
3
]: rj3 = BiLSTM [zj1 ; zj2 ; zj3 ] .
Answer Pointer. We apply a variant of pointer net-
works [Vinyals et al., 2015] as the answer pointer to make the
describe a single block in details, which is shown in Figure 4, predictions. First, the question representation V P is summa-
n
and then introduce the entire architecture. rized into a fixed-size summary vector s as: s = i=1 αi vi ,
T
Single Aligning Block. First, the similarity matrix E ∈ where αi ∝ exp(w vi ). Then we compute the start probabil-
Rn×m is computed using Eq. 1, where the multiplicative ity p1 (i) by heuristically attending the context representation
product with nonlinearity is applied as attention function: R3 with the question summary s as
f (u, v) = relu(Wu u)T relu(Wv v). The question attention
p1 (i) ∝ exp w1T tanh(W1 [ri3 ; s; ri3 ◦ s; ri3 − s])

for the j-th context word is then: softmax(E:j ), which is (15)
used to compute an attended question vector ṽj = V ·
Next, a new question summary s̃ is updated by fusing con-
softmax(E:j ).
text information of the start position, which is computed as
To efficiently fuse the attentive information into the l = R3 · p1 , into the old question summary: s̃ = fusion(s, l).
context, an heuristic fusion function, denoted as o = Finally the end probability p2 (j|i) is computed as
fusion(x, y), is proposed as
p2 (j|i) ∝ exp w2T tanh(W2 [rj3 ; s̃; rj3 ◦ s̃; rj3 − s̃])

(16)
x̃ = relu (Wr [x; y; x ◦ y; x − y])
g = σ (Wg [x; y; x ◦ y; x − y])
5 Experiments
o = g ◦ x̃ + (1 − g) ◦ x (13)
5.1 Implementation Details
where σ denotes the sigmoid activation function, ◦ denotes
We mainly focus on the SQuAD dataset [Rajpurkar et al.,
element-wise multiplication, and the bias term is omitted.
2016] to train and evaluate our model. SQuAD is a machine
The computation is similar to the highway networks [Srivas-
comprehension dataset, totally containing more than 100, 000
tava et al., 2015], where the output vector o is a linear inter-
questions manually annotated by crowdsourcing workers on
polation of the input x and the intermediate vector x̃. A gate
a set of 536 Wikipedia articles. In addition, we also test our
g is used to control the composition degree to which the inter-
model on two adversarial SQuAD datasets [Jia and Liang,
mediate vector is exposed. With this function, the question-
2017], namely AddSent and AddOneSent. In both adversar-
aware context vectors H = [h1 , ..., hm ] can be obtained as:
ial datasets, a confusing sentence with a wrong answer is ap-
hj = fusion(uj , ṽj ).
pended at the end of the context in order to fool the model.
Similar to the above computation, a self alignment is ap-
We evaluate the Reinforced Mnemonic Reader (R.M-
plied to capture the long-term dependencies among context
Reader) by running the following setting. We first train the
words. Again, we compute a similarity matrix B ∈ Rm×m
model until convergence by optimizing Eq. 7. We then fine-
using Eq. 2. The attended context vector is then computed
tune this model with Eq. 11, until the F1 score on the devel-
as: h̃j = H · softmax(B:j ), where softmax(B:j ) is the self opment set no longer improves.
attention for the j-th context word. Using the same fusion We use the Adam optimizer [Kingma and Ba, 2014] for
function as zj = fusion(hj , h̃j ), we can obtain self-aware both ML and DCRL training. The initial learning rates are
context vectors Z = [z1 , ..., zm ]. 0.0008 and 0.0001 respectively, and are halved whenever
Finally, a BiLSTM is used to perform the evidence col- meeting a bad iteration. The batch size is 48 and a dropout
lection, which outputs the fully-aware context vectors R = rate [Srivastava et al., 2014] of 0.3 is used to prevent overfit-
[r1 , ..., rm ] with Z as its inputs. ting. Word embeddings remain fixed during training. For out
Multi-round Alignments with Reattention. To enhance the of vocabulary words, we set the embeddings from Gaussian
ability of capturing complex interactions among inputs, we distributions and keep them trainable. The size of character
stack two more aligning blocks with the reattention mecha- embedding and corresponding LSTMs is 50, the main hidden
nism as follows size is 100, and the hyperparameter γ is 3.
Dev Test AddSent AddOneSent
Single Model Model
EM F1 EM F1 EM F1 EM F1
LR Baseline1 40.0 51.0 40.4 51.0 LR Baseline 17.0 23.2 22.3 41.8
DCN+2 74.5 83.1 75.1 83.1 Match-LSTM1∗ 24.3 34.2 34.8 41.8
FusionNet3 75.3 83.6 76.0 83.9 BiDAF2∗ 29.6 34.2 40.7 46.9
SAN4 76.2 84.1 76.8 84.4 SEDT3∗ 30.0 35.0 40.0 46.5
AttentionReader+† - - 77.3 84.9 ReasoNet4∗ 34.6 39.4 43.6 49.8
BSE5 77.9 85.6 78.6 85.8 FusionNet5∗ 46.2 51.4 54.7 60.7
R-net+† - - 79.9 86.5 R.M-Reader 53.0 58.5 60.9 67.0
SLQA+† - - 80.4 87.0
Hybrid AoA Reader+† - - 80.0 87.3 Table 3: Performance comparison on two adversarial SQuAD
R.M-Reader 78.9 86.3 79.5 86.6 datasets. Wang & Jiang[2017]1 , Seo et al.[2017]2 , Liu et
Ensemble Model al.[2017a]3 , Shen et al.[2016]4 and Huang et al.[2017]5 . ∗ indicates
DCN+2 - - 78.8 86.0 ensemble models.
FusionNet3 78.5 85.8 79.0 86.0
SAN4 78.6 85.8 79.6 86.5 Configuration EM F1 ∆EM ∆F1
BSE5 79.6 86.6 81.0 87.4 R.M-Reader 78.9 86.3 − −
AttentionReader+† - - 81.8 88.2 (1) - Reattention 78.1 85.8 -0.8 -0.5
R-net+† - - 82.6 88.5 (2) - DCRL 78.2 85.4 -0.7 -0.9
(3) - Reattention, DCRL 77.1 84.8 -1.8 -1.5
SLQA+† - - 82.4 88.6
(4) - DCRL, + SCST 78.5 85.8 -0.4 -0.5
Hybrid AoA Reader+† - - 82.5 89.3 (5) Attention: Dot 78.2 85.9 -0.7 -0.4
R.M-Reader 81.2 87.9 82.3 88.5 (6) - Heuristic Sub 78.1 85.7 -0.8 -0.6
Human1 80.3 90.5 82.3 91.2 (7) - Heuristic Mul 78.3 86.0 -0.6 -0.3
(8) Fusion: Gate 77.9 85.6 -1.0 -0.7
Table 2: The performance of Reinforced Mnemonic Reader and (9) Fusion: MLP 77.2 85.2 -1.7 -1.1
other competing approaches on the SQuAD dataset. The results (10) Num of Blocks: 2 78.7 86.1 -0.2 -0.2
of test set are extracted on Feb 2, 2018: Rajpurkar et al.[2016]1 , (11) Num of Blocks: 4 78.8 86.3 -0.1 0
Xiong et al.[2017a]2 , Huang et al.[2017]3 , Liu et al.[2017b]4 and (12) Num of Blocks: 5 77.5 85.2 -1.4 -1.1
Peters[2018]5 . † indicates unpublished works. BSE refers to BiDAF
+ Self Attention + ELMo.
Table 4: Ablation study on SQuAD dev set.

5.2 Overall Results


reattention mechanism and DCRL training method. We no-
We submitted our model on the hidden test set of SQuAD for tice that reattention has more influences on EM score while
evaluation. Two evaluation metrics are used: Exact Match DCRL contributes more to F1 metric, and removing both
(EM), which measures whether the predicted answer are ex- of them results in huge drops on both metrics. Replacing
actly matched with the ground truth, and F1 score, which DCRL with SCST also causes a marginal decline of perfor-
measures the degree of word overlap at token level. mance on both metrics. Next, we relace the default atten-
As shown in Table 2, R.M-Reader achieves an EM score tion function with the dot product: f (u, v) = u · v (5), and
of 79.5% and F1 score of 86.6%. Since SQuAD is a com- both metrics suffer from degradations. (6-7) shows the effec-
petitve MRC benchmark, we also build an ensemble model tiveness of heuristics used in the fusion function. Removing
that consists of 12 single models with the same architecture any of the two heuristics leads to some performance declines,
but initialized with different parameters. Our ensemble model and heuristic subtraction is more effective than multiplica-
improves the metrics to 82.3% and 88.5% respectively2 . tion. Ablation (8-9) further explores different forms of fu-
Table 3 shows the performance comparison on two adver- sion, where gate refers to o = g ◦ x̃ and MLP denotes o = x̃
sarial datasets, AddSent and AddOneSent. All models are in Eq. 4, respectively. In both cases the highway-like function
trained on the original train set of SQuAD, and are tested on has outperformed its simpler variants. Finally, we study the
the two datasets. As we can see, R.M-Reader comfortably effect of different numbers of aligning blocks in (10-12). We
outperforms all previous models by more than 6% in both notice that using 2 blocks causes a slight performance drop,
EM and F1 scores, indicating that our model is more robust while increasing to 4 blocks barely affects the SoTA result.
against adversarial attacks. Interestingly, a very deep alignment with 5 blocks results in
a significant performance decline. We argue that this is be-
5.3 Ablation Study cause the model encounters the degradation problem existed
The contributions of each component of our model are shown in deep networks [He et al., 2016].
in Table 4. Firstly, ablation (1-4) explores the utility of
5.4 Effectiveness of Reattention
2 We further present experiments to demonstrate the effec-
The results are on [Link]
0xe6c23cbae5e440b8942f86641f49fd80. tiveness of reattention mechanism. For the attention redun-
KL divergence - Reattention + Reattention Context: Carolina's secondary featured Pro Bowl safety Kurt Coleman, who
Redundancy led the team with a career high seven interceptions, while also racking up 88
tackles and Pro Bowl cornerback Josh Norman, who developed into a
E 1 to E 2 0.695 ± 0.086 0.866 ± 0.074 shutdown corner during the season and had four interceptions, two of which
E 2 to E 3 0.404 ± 0.067 0.450 ± 0.052 were returned for touchdowns.
B 1 to B 2 0.976 ± 0.092 1.207 ± 0.121 Question: How many interceptions did Josh Norman score touchdowns with
B 2 to B 3 1.179 ± 0.118 1.193 ± 0.097 in 2015?
Deficiency Answer: two

E 2 to E 2 0.650 ± 0.044 0.568 ± 0.059 Context: The further decline of Byzantine state-of-affairs paved the road to a
3 3∗
E to E 0.536 ± 0.047 0.482 ± 0.035 third attack in 1185, when a large Norman army invaded Dyrrachium, owing to
the betrayal of high Byzantine officials. Some time later, Dyrrachium—one of
the most important naval bases of the Adriatic—fell again to Byzantine hands.
Table 5: Comparison of KL diverfence on different attention distri-
Question: Where was Dyrrachium located?
butions on SQuAD dev set. Answer: the Adriatic

Context: The motor used polyphase current which generated a rotating magnetic
dancy problem, we measure the distance of attention distri- field to turn the motor (a principle Tesla claimed to have conceived in 1882). This
innovative electric motor, patented in May 1888, was a simple self-starting design
butions in two adjacent aligning blocks, e.g., softmax(E:j1 ) that did not need a commutator, thus avoiding sparking and the high
and softmax(E:j2 ). Higher distance means less attention re- maintenance of constantly servicing and replacing mechanical brushes.
dundancy. For the attention deficiency problem, we take the Question: What high maintenance part did Tesla's AC motor not require?
arithmetic mean of multiple attention distributions from the Answer: mechanical brushes
ensemble model as the “ground truth” attention distribution

softmax(E:jt ), and compute the distance of individual at-
Figure 5: Predictions with DCRL (red) and with SCST (blue) on
tention softmax(E:jt ) with it. Lower distance refers to less SQuAD dev set.
attention deficiency. We use Kullback–Leibler divergence as
the distance function D, and we report the averaged value
over all examples. 6 Conclusion
Table 5 shows the results. We first see that the reattention We propose the Reinforced Mnemonic Reader, an enhanced
indeed help in alleviating the attention redundancy: the diver- attention reader with two main contributions. First, a reat-
gence between any two adjacent blocks has been successfully tention mechanism is introduced to alleviate the problems
enlarged with reattention. However, we find that the improve- of attention redundancy and deficiency in multi-round align-
ment between the first two blocks is larger than the one of last ment architectures. Second, a dynamic-critical reinforce-
two blocks. We conjecture that the first reattention is more ac- ment learning approach is presented to address the conver-
curate at measuring the similarity of word pairs by using the gence suppression problem existed in traditional reinforce-
original encoded word representation, while the latter reatten- ment learning methods. Our model achieves the state-of-
tion is distracted by highly nonlinear word representations. In the-art results on the SQuAD dataset, outperforming sev-
addition, we notice that the attention deficiency has also been eral strong competing systems. Besides, our model outper-

moderated: the divergence betwen normalized E t and E t is forms existing approaches by more than 6% on two adver-
reduced. sarial SQuAD datasets. We believe that both reattention and
DCRL are general approaches, and can be applied to other
NLP task such as natural language inference. Our future work
5.5 Prediction Analysis is to study the compatibility of our proposed methods.
Figure 5 compares predictions made either with dynamic-
critical reinforcement learning or with self-critical sequence
Acknowledgments
training. We first find that both approaches are able to ob- This research work is supported by National Basic Research
tain answers that match the query-sensitive category. For ex- Program of China under Grant No. 2014CB340303. In addi-
ample, the first example shows that both four and two are tion, we thank Pranav Rajpurkar for help in SQuAD submis-
retrieved when the questions asks for how many. Neverthe- sions.
less, we observe that DCRL constantly makes more accu-
rate prediction on answer spans, especially when SCST al- References
ready points a rough boundary. In the second example, SCST
[Chen et al., 2017] Danqi Chen, Adam Fisch, Jason We-
takes the whole phrase after Dyrrachium as its location. The
third example shows a similar phenomenon, where the SCST ston, and Antoine Bordes. Reading wikipedia to answer
retrieves the phrase constantly servicing and replacing me- open-domain questions. arXiv preprint arXiv:1704.00051,
chanical brushes as its answer. We demonstrates that this is 2017.
because SCST encounters the convergence suppression prob- [Dzmitry Bahdanau, 2015] Yoshua Bengio Dzmitry Bah-
lem, which impedes the prediction of ground truth answer danau, Kyunghyun Cho. Neural machine translation by
boundaries. DCRL, however, successfully avoids such prob- jointly learning to align and translate. In Proceedings of
lem and thus finds the exactly correct entity. ICLR, 2015.
[Goodfellow et al., 2016] Ian Goodfellow, Yoshua Bengio, [Seo et al., 2017] Minjoon Seo, Aniruddha Kembhavi, Ali
and Aaron Courville. Deep learning. MIT Press, 2016. Farhadi, and Hananneh Hajishirzi. Bidirectional attention
[He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing flow for machine comprehension. In Proceedings of ICLR,
Ren, and Jian Sun. Deep residual learning for image recog- 2017.
nition. In Proceedings of CVPR, pages 770–778, 2016. [Shen et al., 2016] Yelong Shen, Po-Sen Huang, Jianfeng
[Hermann et al., 2015] Karl Moritz Hermann, Tomas Ko- Gao, and Weizhu Chen. Reasonet: Learning to stop
cisky, Edward Grefenstette, Lasse Espeholt, Will Kay, reading in machine comprehension. arXiv preprint
Mustafa Suleyman, , and Phil Blunsom. Teaching ma- arXiv:1609.05284, 2016.
chines to read and comprehend. In Proceedings of NIPS, [Srivastava et al., 2014] Nitish Srivastava, Geoffrey Hinton,
2015. Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-
[Hill et al., 2016] Felix Hill, Antoine Bordes, Sumit Chopra, nov. Dropout: A simple way to prevent neural networks
and Jason Weston. The goldilocks principle: Reading chil- from overfitting. The Journal of Machine Learning Re-
dren’s books with explicit memory representations. In Pro- search, pages 1929–1958, 2014.
ceedings of ICLR, 2016. [Srivastava et al., 2015] RupeshKumar Srivastava, Klaus Gr-
[Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and eff, and Jurgen Schmidhuber. Highway networks. arXiv
Jürgen Schmidhuber. Long short-term memory. Neural preprint arXiv:1505.00387, 2015.
computation, 9(8):1735—-1780, 1997. [Sutton and Barto, 1998] Richard S. Sutton and Andrew G.
[Huang et al., 2017] Hsin-Yuan Huang, Chenguang Zhu, Ye- Barto. Reinforcement learning: An introduction. MIT
long Shen, and Weizhu Chen. Fusionnet: Fusing via fully- Press, 1998.
aware attention with application to machine comprehen- [Vinyals et al., 2015] Oriol Vinyals, Meire Fortunato, and
sion. arXiv preprint arXiv:1711.07341, 2017. Navdeep Jaitly. Pointer networks. In Proceedings of NIPS,
[Jia and Liang, 2017] Robin Jia and Percy Liang. Adversar- 2015.
ial examples for evaluating reading comprehension sys- [Wang and Jiang, 2017] Shuohang Wang and Jing Jiang.
tems. In Proceedings of EMNLP, 2017. Machine comprehension using match-lstm and answer
[Kendall et al., 2017] Alex Kendall, Yarin Gal, and Roberto pointer. In Proceedings of ICLR, 2017.
Cipolla. Multi-task learning using uncertainty to weigh [Wang et al., 2017] Wenhui Wang, Nan Yang, Furu Wei,
losses for scene geometry and semantics. arXiv preprint Baobao Chang, and Ming Zhou. Gated self-matching net-
arXiv:1705.07115, 2017. works for reading comprehension and question answering.
[Kingma and Ba, 2014] Diederik P. Kingma and Lei Jimmy In Proceedings of ACL, 2017.
Ba. Adam: A method for stochastic optimization. In [Weissenborn et al., 2017] Dirk Weissenborn, Georg Wiese,
CoRR, abs/1412.6980, 2014.
and Laura Seiffe. Making neural qa as simple as possible
[Liu et al., 2017a] Rui Liu, Junjie Hu, Wei Wei, Zi Yang, but not simpler. In Proceedings of CoNLL, pages 271–280,
and Eric Nyberg. Structural embedding of syntac- 2017.
tic trees for machine comprehension. arXiv preprint
[Xiong et al., 2017a] Caiming Xiong, Victor Zhong, and
arXiv:1703.00572, 2017.
Richard Socher. Dcn+: Mixed objective and deep resid-
[Liu et al., 2017b] Xiaodong Liu, Yelong Shen, Kevin ual coattention for question answering. arXiv preprint
Duh, and Jianfeng Gao. Stochastic answer networks arXiv:1711.00106, 2017.
for machine reading comprehension. arXiv preprint
[Xiong et al., 2017b] Caiming Xiong, Victor Zhong, and
arXiv:1712.03556, 2017.
Richard Socher. Dynamic coattention networks for ques-
[Pennington et al., 2014] Jeffrey Pennington, Richard tion answering. In Proceedings of ICLR, 2017.
Socher, and Christopher D. Manning. Glove: Global
vectors for word representation. In Proceedings of
EMNLP, 2014.
[Peters et al., 2018] Matthew E. Peters, Mark Neumann,
Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton
Lee, and Luke Zettlemoyer. Deep contextualized word
prepresentations. In Proceedings of NAACL, 2018.
[Rajpurkar et al., 2016] Pranav Rajpurkar, Jian Zhang, Kon-
stantin Lopyrev, and Percy Liang. Squad: 100,000+ ques-
tions for machine comprehension of text. In Proceedings
of EMNLP, 2016.
[Rennie et al., 2016] Steven J Rennie, Etienne Marcheret,
Youssef Mroueh, Jarret Ross, and Vaibhava Goel. Self-
critical sequence training for image captioning. arXiv
preprint arXiv:1612.00563, 2016.

You might also like