Reinforced Mnemonic Reader for MRC
Reinforced Mnemonic Reader for MRC
Minghao Hu†∗ , Yuxing Peng† , Zhen Huang† , Xipeng Qiu‡ , Furu Wei§ , Ming Zhou§
†
College of Computer, National University of Defense Technology, Changsha, China
‡
School of Computer Science, Fudan University, Shanghai, China
§
Microsoft Research, Beijing, China
{huminghao09,pengyuhang,huangzhen}@[Link]
xpqiu@[Link], {fuwei,mingzhou}@[Link]
namely random inference and greedy inference. The result and U = {uj }m j=1 , representing question and context respec-
with higher score is always set to be the reward while the
other is the baseline. In this way, the normalized reward is tively, a similarity matrix E ∈ Rn×m is computed as
ensured to be always positive so that no convergence suppres- Eij = f (vi , uj ) (1)
sion will be made.
All of the above innovations are integrated into a new where Eij indicates the similarity between i-th question word
end-to-end neural architecture called Reinforced Mnemonic and j-th context word, and f is a scalar function. Different
Reader in Figure 3. We conducted extensive experiments on methods are proposed to normalize the matrix, resulting in
both the SQuAD [Rajpurkar et al., 2016] dataset and two variants of attention such as bi-attention[Seo et al., 2017] and
adversarial SQuAD datasets [Jia and Liang, 2017] to eval- coattention [Xiong et al., 2017b]. The attention is then used
uate the proposed model. On SQuAD, our single model ob- to attend the question and form a question-aware context rep-
tains an exact match (EM) score of 79.5% and F1 score of resentation H = {hj }m j=1 .
86.6%, while our ensemble model further boosts the result Later, Wang et al. [2017] propose a serial self aligning
to 82.3% and 88.5% respectively. On adversarial SQuAD, method to align the context aginst itself for capturing long-
our model surpasses existing approahces by more than 6% on term dependencies among context words. Weissenborn et
both AddSent and AddOneSent datasets. al. [Weissenborn et al., 2017] apply the self alignment in
a similar way of Eq. 1, yielding another similarity matrix
B ∈ Rm×m as
2 MRC with Reattention
Bij = 1{i6=j} f (hi , hj ) (2)
2.1 Task Description
where 1{·} is an indicator function ensuring that the context
For the MRC tasks, a question Q and a context C are given, word is not aligned with itself. Finally, the attentive informa-
our goal is to predict an answer A, which has different forms tion can be integrated to form a self-aware context represen-
according to the specific task. In the SQuAD dataset [Ra- tation Z = {zj }m j=1 , which is used to predict the answer.
jpurkar et al., 2016], the answer A is constrained as a seg- We refer to the above process as a single-round alignment
ment of text in the context C, nerual networks are designed architecture. Such architecture, however, is limited in its ca-
to model the probability distribution p(A|C, Q). pability to capture complex interactions among question and
context. Therefore, recent works build multi-round align-
2.2 Alignment Architecture for MRC ment architectures by stacking several identical aligning lay-
Among all state-of-the-art works for MRC, one of the key ers [Huang et al., 2017; Xiong et al., 2017a]. More specifi-
factors is the alignment architecture. That is, given the hid- cally, let V t = {vit }ni=1 and U t = {utj }m
j=1 denote the hid-
den representations of question and context, we align each den representations of question and context in t-th layer, and
context word with the entire question using attention mecha- H t = {htj }m j=1 is the corresponding question-aware context
nisms, and enhance the context representation with the atten- representation. Then the two similarity matrices can be com-
tive question information. A detailed comparison of different puted as
alignment architectures is shown in Table 1. t
Eij = f (vit , utj ), t
Bij = 1{i6=j} f (hti , htj ) (3)
Early work for MRC, such as Match-LSTM [Wang and
Jiang, 2017], utilizes the attention mechanism stemmed from However, one problem is that each alignment is not directly
neural machine translation [Dzmitry Bahdanau, 2015] seri- aware of previous alignments in such architecture. The atten-
ally, where the attention is computed inside the cell of recur- tive information can only flow to the subsequent layer through
rent neural networks. A more popular approach is to com- the hidden representation. This can cause two problems: 1)
pute attentions in parallel, resulting in a similarity matrix. the attention redundancy, where multiple attention distribu-
Concretely, given two sets of hidden vectors, V = {vi }ni=1 tions are highly similar. Let softmax(x) denote the softmax
function over a vector x. Then this problem can be formulized … AFC … Broncos ... NFC … Panthers ...
as D(softmax(E:jt )ksoftmax(E:jk )) < σ(t 6= k), where σ
...
is a small bound and D is a function measuring the distri-
bution distance. 2) the attention deficiency, which means team 0.3 0.2 0.1 0.2
that the attention fails to focus on salient parts of the input:
...
∗
D(softmax(E:jt )ksoftmax(E:jt )) > δ, where δ is another AFC softmax(Ei:t-1)
∗
bound and softmax(E:jt ) is the “ground truth” attention dis-
tribution.
… AFC … Broncos ... NFC … Panthers ...
2.3 Reattention Mechanism
...
To address these problems, we propose to temporally memo- AFC 0.4 0.1
...
rize past attentions and explicitly use them to refine current at-
tentions. The intuition is that two words should be correlated Broncos 0 0.3
if their attentions about same texts are highly overlapped, and
...
be less related vice versa. For example, in Figure 2, sup- NFC 0.2 0.4
pose that we have access to previous attentions, and then we
...
can compute their dot product to obtain a “similarity of atten-
tion”. In this case, the similarity of word pair (team, Broncos)
Panthers 0.3 0
...
is higher than (team, Panthers).
Therefore, we define the computation of reattention as fol- softmax(B:jt-1) softmax(B:kt-1)
lows. Let E t−1 and B t−1 denote the past similarity matrices
that are temporally memorized. The refined similarity matrix
E t (t > 1) is computed as softmax(Ei:t-1) ∙ softmax(B:jt-1) ≈ 0.2
t
Ẽij =softmax(Ei:t−1 ) · softmax(B:j
t−1
) softmax(Ei:t-1) ∙ softmax(B:kt-1) ≈ 0.13
t
Eij =f (vit , utj ) + γ Ẽij
t
(4)
Figure 2: Illustrations of reattention for the example in Figure 1.
where γ is a trainable parameter. Here, softmax(Ei:t−1 ) is the
past context attention distribution for the i-th question word,
t−1
and softmax(B:j ) is the self attention distribution for the j- where yk1 and yk2 are the answer span for the k-th example,
th context word. In the extreme case, when there is no overlap and we denote p1 (i|C, Q; θ) and p2 (j|i, C, Q; θ) as p1 (i) and
between two distributions, the dot product will be 0. On the p2 (j|i) respectively for abbreviation.
other hand, if the two distributions are identical and focus on Recently, reinforcement learning (RL), with the task re-
one single word, it will have a maximum value of 1. There- ward measured as word overlap between predicted answer
fore, the similarity of two words can be explicitly measured and groung truth, is introduced to MRC [Xiong et al., 2017a].
using their past attentions. Since the dot product is relatively A baseline b, which is obtained by running greedy inference
small than the original similarity, we initialize the γ with a with the current model, is used to normalize the reward and
tunable hyper-parameter and keep it trainable. The refined reduce variances. Such approach is known as the self-critical
similarity matrix can then be normalized for attending the sequence training (SCST) [Rennie et al., 2016], which is first
question. Similarly, we can compute the refined matrix B t used in image caption. More specifically, let R(As , A∗ ) de-
to get the unnormalized self reattention as note the F1 score between a sampled answer As and the
ground truth A∗ . The training objective is to minimize the
t
B̃ij =softmax(Bi:t−1 ) · softmax(B:j t−1
)
negative expected reward by
t
Bij =1(i6=j) f (hti , htj ) + γ B̃ij
t
(5)
LSCST (θ) = −EAs ∼pθ (A) [R(As ) − R(Â)] (8)
both As and Â. The one with higher score is set as reward, Interactive B:jt-1 hjt
while the other is baseline. We call this approach as dynamic- Alignment ... ...
the score of random inference is higher than the greedy one, Encoder
DCRL is equivalent to SCST. Thus, Eq. 9 is a special case of
Eq. 10.
Following [Xiong et al., 2017a] and [Kendall et al., 2017], Figure 3: The architecture overview of Reinforced Mnemonic
Reader. The subfigures to the right show detailed demonstrations
we combine ML and DCRL objectives using homoscedastic
of the reattention mechanism: 1) refined E t to attend the query; 2)
uncertainty as task-dependent weightings so as to stabilize the refined B t to attend the context.
RL training as
1 1 Besides, a character-level embedding is obtained by encoding
L= 2
LM L + 2 LDCRL + log σa2 + log σb2 (11)
2σa 2σb the character sequence with a bi-directional long short-term
memory network (BiLSTM) [Hochreiter and Schmidhuber,
where σa and σb are trainable parameters.
1997], where two last hidden states are concatenated to form
the embedding. In addition, we use binary feature of exact
4 End-to-end Architecture match, POS embedding and NER embedding for both ques-
Based on previous innovations, we introduce an end-to-end tion and context, as suggested in [Chen et al., 2017]. Together
architecture called Reinforced Mnemonic Reader, which is the inputs X Q = {xqi }ni=1 and X C = {xcj }mj=1 are obtained.
shown in Figure 3. It consists of three main components: 1) To model each word with its contextual information, a
an encoder builds contextual representations for question and weight-shared BiLSTM is utilized to perform the encoding
context jointly; 2) an iterative aligner performs multi-round vi = BiLSTM(xqi ), uj = BiLSTM(xcj ) (12)
alignments between question and context with the reattention
mechanism; 3) an answer pointer predicts the answer span se- Thus, the contextual representations for both question and
quentially. Beblow we give more details of each component. context words can be obtained, denoted as two matrices:
Encoder. Let W Q = {wiq }ni=1 and W C = {wjc }m V = [v1 , ..., vn ] ∈ R2d×n and U = [u1 , ..., um ] ∈ R2d×m .
j=1 denote
the word sequences of the question and context respectively. Iterative Aligner. The iterative aligner contains a stack of
The encoder firstly converts each word to an input vector. three aligning blocks. Each block consists of three modules:
We utilize the 100-dim GloVe embedding [Pennington et al., 1) an interactive alignment to attend the question into the con-
2014] and 1024-dim ELMo embedding [Peters et al., 2018]. text; 2) a self alignment to attend the context against itself;
3) an evidence collection to model the context representation
1
In practice we found that a better approximation can be made with a BiLSTM. The reattention mechanism is utilized be-
by considering a top-K answer list, where  is the best result and tween two blocks, where past attentions are temporally mem-
As is sampled from the rest of the list. orizes to help modulating current attentions. Below we first
Evidence
Interactive Alignment Self Alignment Collection
R1 , Z 1 , E 1 , B 1 = align1 (U, V )
u1 h1 z1 r1
R2 , Z 2 , E 2 , B 2 = align2 (R1 , V, E 1 , B 1 )
u2 Fusion h2 Fusion z2 r2
E B BiLSTM R3 , Z 3 , E 3 , B 3 = align3 (R2 , V, E 2 , B 2 , Z 1 , Z 2 ) (14)
...
...
...
...
um hm zm rm
v1 h1 where alignt denote the t-th block. In the t-th block (t > 1),
v1 we fix the hidden representation of question as V , and set
v2 h2
the hidden representation of context as previous fully-aware
...
...
...
Context: The motor used polyphase current which generated a rotating magnetic
dancy problem, we measure the distance of attention distri- field to turn the motor (a principle Tesla claimed to have conceived in 1882). This
innovative electric motor, patented in May 1888, was a simple self-starting design
butions in two adjacent aligning blocks, e.g., softmax(E:j1 ) that did not need a commutator, thus avoiding sparking and the high
and softmax(E:j2 ). Higher distance means less attention re- maintenance of constantly servicing and replacing mechanical brushes.
dundancy. For the attention deficiency problem, we take the Question: What high maintenance part did Tesla's AC motor not require?
arithmetic mean of multiple attention distributions from the Answer: mechanical brushes
ensemble model as the “ground truth” attention distribution
∗
softmax(E:jt ), and compute the distance of individual at-
Figure 5: Predictions with DCRL (red) and with SCST (blue) on
tention softmax(E:jt ) with it. Lower distance refers to less SQuAD dev set.
attention deficiency. We use Kullback–Leibler divergence as
the distance function D, and we report the averaged value
over all examples. 6 Conclusion
Table 5 shows the results. We first see that the reattention We propose the Reinforced Mnemonic Reader, an enhanced
indeed help in alleviating the attention redundancy: the diver- attention reader with two main contributions. First, a reat-
gence between any two adjacent blocks has been successfully tention mechanism is introduced to alleviate the problems
enlarged with reattention. However, we find that the improve- of attention redundancy and deficiency in multi-round align-
ment between the first two blocks is larger than the one of last ment architectures. Second, a dynamic-critical reinforce-
two blocks. We conjecture that the first reattention is more ac- ment learning approach is presented to address the conver-
curate at measuring the similarity of word pairs by using the gence suppression problem existed in traditional reinforce-
original encoded word representation, while the latter reatten- ment learning methods. Our model achieves the state-of-
tion is distracted by highly nonlinear word representations. In the-art results on the SQuAD dataset, outperforming sev-
addition, we notice that the attention deficiency has also been eral strong competing systems. Besides, our model outper-
∗
moderated: the divergence betwen normalized E t and E t is forms existing approaches by more than 6% on two adver-
reduced. sarial SQuAD datasets. We believe that both reattention and
DCRL are general approaches, and can be applied to other
NLP task such as natural language inference. Our future work
5.5 Prediction Analysis is to study the compatibility of our proposed methods.
Figure 5 compares predictions made either with dynamic-
critical reinforcement learning or with self-critical sequence
Acknowledgments
training. We first find that both approaches are able to ob- This research work is supported by National Basic Research
tain answers that match the query-sensitive category. For ex- Program of China under Grant No. 2014CB340303. In addi-
ample, the first example shows that both four and two are tion, we thank Pranav Rajpurkar for help in SQuAD submis-
retrieved when the questions asks for how many. Neverthe- sions.
less, we observe that DCRL constantly makes more accu-
rate prediction on answer spans, especially when SCST al- References
ready points a rough boundary. In the second example, SCST
[Chen et al., 2017] Danqi Chen, Adam Fisch, Jason We-
takes the whole phrase after Dyrrachium as its location. The
third example shows a similar phenomenon, where the SCST ston, and Antoine Bordes. Reading wikipedia to answer
retrieves the phrase constantly servicing and replacing me- open-domain questions. arXiv preprint arXiv:1704.00051,
chanical brushes as its answer. We demonstrates that this is 2017.
because SCST encounters the convergence suppression prob- [Dzmitry Bahdanau, 2015] Yoshua Bengio Dzmitry Bah-
lem, which impedes the prediction of ground truth answer danau, Kyunghyun Cho. Neural machine translation by
boundaries. DCRL, however, successfully avoids such prob- jointly learning to align and translate. In Proceedings of
lem and thus finds the exactly correct entity. ICLR, 2015.
[Goodfellow et al., 2016] Ian Goodfellow, Yoshua Bengio, [Seo et al., 2017] Minjoon Seo, Aniruddha Kembhavi, Ali
and Aaron Courville. Deep learning. MIT Press, 2016. Farhadi, and Hananneh Hajishirzi. Bidirectional attention
[He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing flow for machine comprehension. In Proceedings of ICLR,
Ren, and Jian Sun. Deep residual learning for image recog- 2017.
nition. In Proceedings of CVPR, pages 770–778, 2016. [Shen et al., 2016] Yelong Shen, Po-Sen Huang, Jianfeng
[Hermann et al., 2015] Karl Moritz Hermann, Tomas Ko- Gao, and Weizhu Chen. Reasonet: Learning to stop
cisky, Edward Grefenstette, Lasse Espeholt, Will Kay, reading in machine comprehension. arXiv preprint
Mustafa Suleyman, , and Phil Blunsom. Teaching ma- arXiv:1609.05284, 2016.
chines to read and comprehend. In Proceedings of NIPS, [Srivastava et al., 2014] Nitish Srivastava, Geoffrey Hinton,
2015. Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-
[Hill et al., 2016] Felix Hill, Antoine Bordes, Sumit Chopra, nov. Dropout: A simple way to prevent neural networks
and Jason Weston. The goldilocks principle: Reading chil- from overfitting. The Journal of Machine Learning Re-
dren’s books with explicit memory representations. In Pro- search, pages 1929–1958, 2014.
ceedings of ICLR, 2016. [Srivastava et al., 2015] RupeshKumar Srivastava, Klaus Gr-
[Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and eff, and Jurgen Schmidhuber. Highway networks. arXiv
Jürgen Schmidhuber. Long short-term memory. Neural preprint arXiv:1505.00387, 2015.
computation, 9(8):1735—-1780, 1997. [Sutton and Barto, 1998] Richard S. Sutton and Andrew G.
[Huang et al., 2017] Hsin-Yuan Huang, Chenguang Zhu, Ye- Barto. Reinforcement learning: An introduction. MIT
long Shen, and Weizhu Chen. Fusionnet: Fusing via fully- Press, 1998.
aware attention with application to machine comprehen- [Vinyals et al., 2015] Oriol Vinyals, Meire Fortunato, and
sion. arXiv preprint arXiv:1711.07341, 2017. Navdeep Jaitly. Pointer networks. In Proceedings of NIPS,
[Jia and Liang, 2017] Robin Jia and Percy Liang. Adversar- 2015.
ial examples for evaluating reading comprehension sys- [Wang and Jiang, 2017] Shuohang Wang and Jing Jiang.
tems. In Proceedings of EMNLP, 2017. Machine comprehension using match-lstm and answer
[Kendall et al., 2017] Alex Kendall, Yarin Gal, and Roberto pointer. In Proceedings of ICLR, 2017.
Cipolla. Multi-task learning using uncertainty to weigh [Wang et al., 2017] Wenhui Wang, Nan Yang, Furu Wei,
losses for scene geometry and semantics. arXiv preprint Baobao Chang, and Ming Zhou. Gated self-matching net-
arXiv:1705.07115, 2017. works for reading comprehension and question answering.
[Kingma and Ba, 2014] Diederik P. Kingma and Lei Jimmy In Proceedings of ACL, 2017.
Ba. Adam: A method for stochastic optimization. In [Weissenborn et al., 2017] Dirk Weissenborn, Georg Wiese,
CoRR, abs/1412.6980, 2014.
and Laura Seiffe. Making neural qa as simple as possible
[Liu et al., 2017a] Rui Liu, Junjie Hu, Wei Wei, Zi Yang, but not simpler. In Proceedings of CoNLL, pages 271–280,
and Eric Nyberg. Structural embedding of syntac- 2017.
tic trees for machine comprehension. arXiv preprint
[Xiong et al., 2017a] Caiming Xiong, Victor Zhong, and
arXiv:1703.00572, 2017.
Richard Socher. Dcn+: Mixed objective and deep resid-
[Liu et al., 2017b] Xiaodong Liu, Yelong Shen, Kevin ual coattention for question answering. arXiv preprint
Duh, and Jianfeng Gao. Stochastic answer networks arXiv:1711.00106, 2017.
for machine reading comprehension. arXiv preprint
[Xiong et al., 2017b] Caiming Xiong, Victor Zhong, and
arXiv:1712.03556, 2017.
Richard Socher. Dynamic coattention networks for ques-
[Pennington et al., 2014] Jeffrey Pennington, Richard tion answering. In Proceedings of ICLR, 2017.
Socher, and Christopher D. Manning. Glove: Global
vectors for word representation. In Proceedings of
EMNLP, 2014.
[Peters et al., 2018] Matthew E. Peters, Mark Neumann,
Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton
Lee, and Luke Zettlemoyer. Deep contextualized word
prepresentations. In Proceedings of NAACL, 2018.
[Rajpurkar et al., 2016] Pranav Rajpurkar, Jian Zhang, Kon-
stantin Lopyrev, and Percy Liang. Squad: 100,000+ ques-
tions for machine comprehension of text. In Proceedings
of EMNLP, 2016.
[Rennie et al., 2016] Steven J Rennie, Etienne Marcheret,
Youssef Mroueh, Jarret Ross, and Vaibhava Goel. Self-
critical sequence training for image captioning. arXiv
preprint arXiv:1612.00563, 2016.