Document Representation
Document Representation
Document Representation
5.1 Introduction
and high dimensionality. One-hot document representation model has very little sense
about the semantics of the words or, more formally, the distances between the words.
Hence, the approach for representing text documents uses multi-word terms as vector
components, which are noun phrases extracted using a combination of linguistic and
statistical criteria. This representation is motivated by the notion of topic models that
terms should contain more semantic information than individual words. And another
advantage of using terms for representing a document is its lower dimensionality
compared with the traditional one-hot document representation.
Nevertheless, applying these to generation tasks remains difficult. To understand
how discourse units are connected, one has to understand the communicative function
of each unit, and the role it plays within the context that encapsulates it, recursively
all the way up for the entire text. Identifying increasingly sophisticated human-
developed features may be insufficient for capturing these patterns, but developing
representation-based alternatives has also been difficult. Although document repre-
sentation can capture aspects of coherent sentence structure, it is not clear how it
could help in generating more broadly cohesive text.
Recently, neural network models have shown compelling results in generating
meaningful and grammatical documents in sequence generation tasks like machine
translation or parsing. It is partially attributed to the ability of these systems to cap-
ture local compositionally: the way neighboring words are combined semantically
and syntactically to form meanings that they wish to express. Based on neural net-
work models, many research works have developed a variety of ways to incorporate
document-level contextual information. These models are all hybrid architectures in
that they are recurrent at the sentence level, but use a different structure to summarize
the context outside the sentence. Furthermore, some models explore multilevel recur-
rent architectures for combining local and global information in language modeling.
In this chapter, we first introduce the one-hot representation for documents. Next,
we extensively introduce topic models that aim to learn latent topic distributions of
words and documents. Further, we give an introduction on distributed document rep-
resentation including paragraph vector and neural document representations. Finally,
we introduce several typical real-world applications of document representations,
including information retrieval and question answering.
l
d= wi , (5.1)
k=1
where l is the length of the document d. And similar to one-hot sentence represen-
tation, the TF-IDF method is also proposed to enhance the ability of bag-of-words
representation in reflecting how important a word is to a document in a corpus.
Actually, the bag-of-words representation is mainly used as a tool of feature gen-
eration, and the most common type of features calculated from this method is word
frequency appearing in the documents. This method is simple but efficient and some-
times can reach excellent performance in many real-world applications. However, the
bag-of-words representation still ignores entirely the word order information, which
means different documents can have the same representation as long as the same
words are used. Furthermore, bag-of-words representation has little sense about the
semantics of the words or, more formally, the distances between words, which means
this method cannot utilize rich information hidden in the word representations.
relationship with Russia. We could then navigate through time to reveal how these
specific themes have changed, tracking, for example, the changes in the conflict in
the Middle East over the last 50 years. And, in all of this exploration, we would be
pointed to the original articles relevant to the themes. The thematic structure would
be a new kind of window through which to explore and digest the collection.
But we do not interact with electronic archives in this way. While more and more
texts are available online, we do not have the human power to read and study them to
provide the kind of browsing experience described above. To this end, machine learn-
ing researchers have developed probabilistic topic modeling, a suite of algorithms
that aim to discover and annotate vast archives of documents with thematic informa-
tion. Topic modeling algorithms are statistical methods that analyze the words of the
original texts to explore the themes that run through them, how those themes are con-
nected, and how they change over time. Topic modeling algorithms do not require
any prior annotations or labeling of the documents. The topics emerge from the
analysis of the original texts. Topic modeling enables us to organize and summarize
electronic archives at a scale that would be impossible by human annotation.
A variety of probabilistic topic models have been used to analyze the content of
documents and the meaning of words. Hofmann first introduced the probabilistic
topic approach to document modeling in his Probabilistic Latent Semantic Indexing
method (pLSI). The pLSI model does not make any assumptions about how the
mixture weights are generated, making it difficult to test the generalization ability of
the model to new documents. Thus, Latent Dirichlet Allocation (LDA) was extended
from this model by introducing a Dirichlet prior to the model. LDA is believed as a
simple but efficient topic model. We first describe the basic ideas of LDA [6].
The intuition behind LDA is that documents exhibit multiple topics. LDA is a
statistical model of document collections that tries to capture this intuition. It is most
easily described by its generative process, the imaginary random process by which
the model assumes the documents arose.
We formally define a topic to be a distribution over a fixed vocabulary. We assume
that these topics are specified before any data has been generated. Now for each
document in the collection, we generate the words in a two-stage process.
This statistical model reflects the intuition that documents exhibit multiple topics.
Each document exhibits the topics with different proportions (step #1); each word in
5.3 Topic Model 95
each document is drawn from one of the topics (step #2b), where the selected topic
is chosen from the per-document distribution over topics (step #2a).
We emphasize that the algorithms have no information about these subjects and the
articles are not labeled with topics or keywords. The interpretable topic distributions
arise by computing the hidden structure that likely generated the observed collection
of documents.
LDA and other topic models are part of the broader field of probabilistic modeling.
In generative probabilistic modeling, we treat our data as arising from a generative
process that includes hidden variables. This generative process defines a joint prob-
ability distribution over both the observed and hidden random variables. Given the
observed variables, we perform data analysis by using that joint distribution to com-
pute the conditional distribution of the hidden variables. This conditional distribution
is also called the posterior distribution.
LDA falls precisely into this framework. The observed variables are the words of
the documents, the hidden variables are the topic structure, and the generative process
is as described above. The computational problem of inferring the hidden topic
structure from the documents is the problem of computing the posterior distribution,
the conditional distribution of the hidden variables given the documents.
We can describe LDA more formally with the following notation. The topics are
β1:K , where each βk is a distribution over the vocabulary. The topic proportions for the
dth document are θd , where θdk is the topic proportion for topic k in document d. The
topic assignments for the dth document are z d , where z d,n is the topic assignment
for the nth word in document d. Finally, the observed words for document d are
wd , where wd,n is the nth word in document d, which is an element from the fixed
vocabulary.
With this notation, the generative process for LDA corresponds to the following
joint distribution of the hidden and observed variables:
K
D
N
P(β1:K , θ1:D , z 1:D , w1:D ) = P(βi ) P(θd )( P(z d,n |θd )P(wd,n |β1:K , z d,n ).
i=1 d=1 n=1
(5.2)
Notice that this distribution specifies the number of dependencies. For example,
the topic assignment z d,n depends on the per-document topic proportions θd . As
another example, the observed word wd,n depends on the topic assignment z d,n and
all of the topics β1:K .
These dependencies define LDA. They are encoded in the statistical assumptions
behind the generative process, in the particular mathematical form of the joint distri-
bution, and in a third way, in the probabilistic graphical model for LDA. Probabilistic
graphical models provide a graphical language for describing families of probability
96 5 Document Representation
α θd zd,n wd,n βi η
N D K
Fig. 5.1 The architecture of graphical model for Latent Dirichlet Allocation
distributions. The graphical model for LDA is in Fig. 5.1. Each node is a random vari-
able and is labeled according to its role in the generative process. The hidden nodes,
the topic proportions, assignments, and topics are unshaded. The observed nodes and
the words of the documents, are shaded. We use rectangles as plate notation to denote
replication. The N plate denotes the collection of words within documents; the D
plate denotes the collection of documents within the collection. These three repre-
sentations are equivalent ways of describing the probabilistic assumptions behind
LDA.
The numerator is the joint distribution of all the random variables, which can
be easily computed for any setting of the hidden variables. The denominator is
the marginal probability of the observations, which is the probability of seeing the
observed corpus under any topic model. In theory, it can be computed by summing
the joint distribution over every possible instantiation of the hidden topic structure.
Topic modeling algorithms form an approximation of the above equation by form-
ing an alternative distribution over the latent topic structure that is adapted to be close
to the true posterior. Topic modeling algorithms generally fall into two categories:
sampling-based algorithms and variational algorithms.
Sampling-based algorithms attempt to collect samples from the posterior by
approximating it with an empirical distribution. The most commonly used sampling
algorithm for topic modeling is Gibbs sampling, where we construct a Markov chain,
a sequence of random variables, each dependent on the previous—whose limiting
distribution is posterior. The Markov chain is defined on the hidden topic variables
for a particular corpus, and the algorithm is to run the chain for a long time, collect
samples from the limiting distribution, and then approximate the distribution with
the collected samples.
5.3 Topic Model 97
5.3.2 Extensions
The simple LDA model provides a powerful tool for discovering and exploiting
the hidden thematic structure in large archives of text. However, one of the main
advantages of formulating LDA as a probabilistic model is that it can easily be
used as a module in more complicated models for more complex goals. Since its
introduction, LDA has been extended and adapted in many ways.
LDA is defined by the statistical assumptions it makes about the corpus. One active
area of topic modeling research is how to relax and extend these assumptions to
uncover a more sophisticated structure in the texts.
One assumption that LDA makes is the bag-of-words assumption that the order
of the words in the document does not matter. While this assumption is unrealistic, it
is reasonable if our only goal is to uncover the coarse semantic structure of the texts.
For more sophisticated goals, such as language generation, it is patently not appropri-
ate. There have been many extensions to LDA that model words non-exchangeable.
For example, [59] develops a topic model that relaxes the bag-of-words assumption
by assuming that the topics generate words conditional on the previous word; [22]
develops a topic model that switches between LDA and a standard HMM. These mod-
els expand the parameter space significantly but show improved language modeling
performance.
Another assumption is that the order of documents does not matter. Again, this
can be seen by noticing that Eq. 5.3 remains invariant to permutations of the ordering
of documents in the collection. This assumption may be unrealistic when analyzing
long-running collections that span years or centuries. In such collections, we may
want to assume that the topics change over time. One approach to this problem is the
dynamic topic model [5], a model that respects the ordering of the documents and
gives a more productive posterior topical structure than LDA.
The third assumption about LDA is that the number of topics is assumed known
and fixed. The Bayesian nonparametric topic model provides an elegant solution: The
collection determines the number of topics during posterior inference, and new doc-
uments can exhibit previously unseen topics. Bayesian nonparametric topic models
98 5 Document Representation
have been extended to hierarchies of topics, which find a tree of topics, moving from
more general to more concrete, whose particular structure is inferred from the data
[4].
In many text analysis settings, the documents contain additional information such
as author, title, geographic location, links, and others that we might want to account
for when fitting a topic model. There has been a flurry of research on adapting topic
models to include meta-data.
The author-topic model [51] is an early success story for this kind of research. The
topic proportions are attached to authors; papers with multiple authors are assumed
to attach each word to an author, drawn from a topic drawn from his or her topic
proportions. The author-topic model allows for inferences about authors as well as
documents.
Many document collections are linked. For example, scientific papers are linked by
citations, or web pages are connected by hyperlinks. And several topic models have
been developed to account for those links when estimating the topics. The relational
topic model of [9] assumes that each document is modeled as in LDA and that the
links between documents depend on the distance between their topic proportions.
This is both a new topic model and a new network model. Unlike traditional statistical
models of networks, the relational topic model takes into account node attributes in
modeling the links.
Other work that incorporates meta-data into topic models includes models of
linguistic structure [8], models that account for distances between corpora [60], and
models of named entities [42]. General-purpose methods for incorporating meta-
data into topic models include Dirichlet-multinomial regression models [39] and
supervised topic models [37].
[Link] Acceleration
log P(Θ, Φ|W, α , β ) ≥E Q [log P(W, Z |Θ, Φ) − log Q(Z )] + log P(Θ|α ) + log P(Φ|β )
J (Θ, Φ, Q(Z )). (5.4)
1
S
E Q [log P(W, Z |Θ, Φ) − log Q(Z )] ≈ log P(W, Z (s) |Θ, Φ) − log Q(Z (s) ),
S s=1
(5.5)
where Z (1) , . . . , Z (S) ∼ Q(Z ) = P(Z |W, Θ, Φ). The sample size is set as S = 1
and the model uses Z as an abbreviation of Z (1) .
Sampling Z : Each dimension of Z can be sampled independently:
Cwk + β − 1
θ̂dk ∝ Cdk + αk − 1, φ̂wk = . (5.8)
Ck + β̄ − V
Instead of computing and storing Θ̂ and Φ̂, we compute and store Cd and Cw to
save memory because the latter are sparse. Plug Eqs. 5.8–5.6, and let α = α − 1, β =
β − 1, we get the full MCEM algorithm, which iteratively performs the following
two steps until a given iteration number is reached:
100 5 Document Representation
Cwk + βw
Q(z d,n = k) ∝ (Cdk + αk ) . (5.9)
Ck + β̄
As shown in Fig. 5.2, paragraph vector maps every paragraph to a unique vector,
represented by a column in the matrix P and maps every word to a unique vector,
represented by a column in word embedding matrix E. The paragraph vector and
word vectors are averaged or concatenated to predict the next word in a context. More
formally, compared to the word vector framework, the only change in this model is
in the following equation, where h is constructed from E and P.
Classifier wi+3
Average/Concatenate
Paragraph Matrix D W W W
Paragraph
wi wi+1 wi+2
id
The other part of this model is that given a sequence of training words w1 , w2 , w3 ,
…, wl , the objective of the paragraph vector model is to maximize the average log
probability:
1
l−k
O= log P(wi | wi−k , . . . , wi+k ). (5.11)
l i=k
And the prediction task is typically done via a multi-class classifier, such as
softmax. Thus, the probability equation is
e ywi
P(wi | wi−k , . . . , wi+k ) = y j . (5.12)
j e
The paragraph token can be thought of as another word. It acts as a memory that
remembers what is missing from the current context, or the topic of the paragraph. For
this reason, this model is often called the Distributed Memory Model of Paragraph
Vectors (PV-DM).
The above method considers the concatenation of the paragraph vector with the
word vectors to predict the next word in a text window. Another way is to ignore the
context words in the input, but force the model to predict words randomly sampled
from the paragraph in the output. In reality, what this means is that at each iteration
of stochastic gradient descent, we sample a text window, then sample a random word
from the text window and form a classification task given the Paragraph Vector. This
technique is shown in Fig. 5.3. This version is named the Distributed Bag-of-Words
version of Paragraph Vector (PV-DBOW), as opposed to the Distributed Memory
version of Paragraph Vector (PV-DM) in the previous section.
102 5 Document Representation
Paragraph Matrix D
Paragraph
id
In addition to being conceptually simple, this model requires to store fewer data.
The data only needed to be stored is the softmax weights as opposed to both softmax
weights and word vectors in the previous model. This model is also similar to the
Skip-gram model in word vectors.
In this part, we introduce two main kinds of neural networks for document repre-
sentation including document-context language model and hierarchical document
autoencoder.
Recurrent architectures can be used to combine local and global information in doc-
ument language modeling. The simplest such model would be to train a single RNN,
ignoring sentence boundaries as mentioned above; the last hidden state from the pre-
vious sentence t − 1 is used to initialize the first hidden state in sentence t. In such
an architecture, the length of the RNN is equal to the number of tokens in the docu-
ment; in typical genres such as news texts, this means training RNNs from sequences
of several hundred tokens, which introduces two problems: (1) Information decay
In a sentence with thirty tokens (not unusual in news text), the contextual informa-
tion from the previous sentence must be propagated through the recurrent dynamics
thirty times before it can reach the last token of the current sentence. Meaningful
document-level information is unlikely to survive such a long pipeline. (2) Learning
It is notoriously difficult to train recurrent architectures that involve many time steps.
5.4 Distributed Document Representation 103
where l is the length of sentence t − 1. The ccDCLM model then creates additional
paths for this information to impact each hidden representation in the current sentence
t. Writing wt,n for the word representation of the nth word in the tth sentence, we
have
where gθ (·) is the activation function parameterized by θ and f (·) is a function that
combines the context vector with the input xt,n for the hidden state. Here we simply
concatenate the representations,
The emission probability for yt,n is then computed from ht,n as in the standard
RNNLM. The underlying assumption of this model is that contextual information
should impact the generation of each word in the current sentence. The model,
therefore, introduces computational “short-circuits” for cross-sentence information,
as illustrated in Fig. 5.4.
Then, the context vector ct−1 is directly used in the output layer as
h wt (enc) = LSTMwor
encode (wt , h t−1 (enc)).
d v
(5.18)
The vector output at the ending time step is used to represent the entire sentence
as
s = h wends . (5.19)
Decode -Word
Decode-Sentence
Encode-Sentence
Encode -Word
h st (enc) = LSTMsentence
encode (s, h t−1 (enc)).
s
(5.20)
Representation h send D computed at the final time step is used to represent the entire
document: d = h send D .
Thus one LSTM operates at the token level, leading to the acquisition of sentence-
level representations that are then used as inputs into the second LSTM that acquires
document-level representations, in a hierarchical structure.
As with encoding, the decoding algorithm operates on a hierarchical structure with
two layers of LSTMs. LSTM outputs at sentence level for time step t are obtained
by
h st (dec) = LSTMsentence
decode (st , h t−1 (dec)).
s
(5.21)
The initial time step h s0 (d) = e D , the end-to-end output from the encoding proce-
dure h st (d) is used as the original input into LSTMwor d
decode for subsequently predicting
tokens within sentence t + 1. LSTMdecode predicts tokens at each position sequen-
wor d
tially, the embedding of which is then combined with earlier hidden vectors for the
next time-step prediction until the ends token is predicted. The procedure can be
summarized as follows:
h wt (dec) = LSTMsentence
decode (wt , h t−1 (dec)),
w
(5.22)
Decode -Word
Decode-Sentence
Encode-Sentence
Encode -Word
exp(vi )
αi = . (5.25)
j exp(v j )
The attention vector is then created by averaging weights over all input sentences:
ND
mt = αi h is (enc) (5.26)
i=1
5.5 Applications
For the given query q and document d, traditional information retrieval models
estimate their relevance through lexical matches. Neural information retrieval mod-
els pay more attention to garner the query and document relevance from semantic
matches. Both lexical and semantic matches are essential for neural information
retrieval. Thriving from neural network black magic, it helps information retrieval
models catch more sophisticated matching features and have achieved the state of
the art in the information retrieval task [17].
Current neural ranking models can be categorized into two groups: representation-
based and interaction-based [23]. The earlier works mainly focus on representation-
based models. They learn good representations and match them in the learned repre-
sentation space of queries and documents. Interaction-based methods, on the other
hand, model the query-document matches from the interactions of their terms.
The representation-based methods directly match the query and documents by learn-
ing two distributed representations, respectively, and then compute the matching
score based on the similarity between them. In recent years, several deep neural
models have been explored based on such Siamese architecture, which can be done
by feedforward layers, convolutional neural networks, or recurrent neural networks.
Reference [26] proposes Deep Structured Semantic Models (DSSM) first to hash
words to the letter-trigram-based representation. And then use a multilayer fully
connected neural network to encode a query (or a document) as a vector. The rel-
evance between the query and document can be simply calculated with the cosine
similarity. Reference [26] trains the model by minimizing the cross-entropy loss on
click-through data where each training sample consists of a query q, a positive doc-
ument d + , and a uniformly sampled negative document set D − :
+
+ − er ·cos(q,d )
L DSS M (q, d , D ) = − log r ·cos(q,d)
, (5.27)
d∈D e
where D = d + ∪ D − .
Furthermore, CDSSM [54] and ARC-I [25] utilize convolutional neural network
(CNN), while LSTM-RNN [44] adopts recurrent neural network with Long Short-
Term Memory (LSTM) units to represent a sentence better. Reference [53] also comes
up with a more sophisticated similarity function by leveraging additional layers of
the neural network.
document
query
neural network
interaction matrix
and aggregate the partial evidence of relevance. ARC-II [25] and MatchPyra-
mind [45] utilize convolutional neural network to capture complicated patterns from
word-level interactions. The Deep Relevance Matching Model (DRMM) uses pyra-
mid pooling (histogram) to summarize the word-level similarities into ranking mod-
els [23]. There are also some works establishing position-dependent interactions for
ranking models [27, 46].
Kernel-based Neural Ranking Model (K-NRM) [66] and its convolutional version
Conv-KNRM [17] achieve the state of the art in neural information retrieval. K-NRM
first establishes a translation matrix M in which each element Mi j is the cosine
similarity of ith word in q and jth word in d. Then K-NRM utilizes kernels to
convert translation matrix M to ranking features φ(M) :
n
φ(M) = log K(Mi ), (5.28)
i=1
Each RBF kernel K k calculates how word pair similarities are distributed:
(Mi j − μk )2
K k (Mi ) = exp − . (5.30)
j
2σ K2
For the given query q, D +,− are the pair-wise preferences from the ground truth.
d and d − are two documents such that d + is more relevant with q than d − . Conv-
+
KNRM extends K-NRM to model n-gram semantic matches based on the convolu-
tional neural network which can leverage snippet information.
[Link] Summary
Question Answering (QA) is one of the most important tasks and so are document-
level applications in NLP. Many efforts have been invested in QA, especially in
machine reading comprehension and open-domain QA. In this section, we will intro-
duce the advances in these two tasks, respectively.
5.5 Applications 111
As shown in Fig. 5.10, machine reading comprehension aims to determine the answer
a to the question q given a passage p. The task could be viewed as a supervised
learning problem: given a collection of training examples {( pi , qi , ai )}i=1
n
, we want
to learn a mapping f (·) that takes the passage pi and corresponding question qi as
inputs and outputs âi , where evaluate(âi , ai ) is maximized. The evaluation metric is
typically correlated with the answer type, which will be discussed in the following.
Generally, the current machine reading comprehension task could be divided into
four categories depending on the answer types according to [10], i.e., cloze style,
multiple choices, span prediction, and free-form answer.
The cloze style task such as CNN/Daily Mail [24] consists of fill-in-the-blank
sentences where the question contains a placeholder to be filled in. The answer a is
either chosen from a predefined candidate set |A| or from the vocabulary |V |. The
multiple-choice task such as RACE [30] and MCTest [50] aims to select the best
answer from a set of answer choices. It is typical to use accuracy to measure the
performance on these two tasks: the percentage of correctly answered questions in
the whole example set, since the question could be either correctly answered or not
from the given hypothesized answer set.
The span prediction task such as SQuAD [49] is perhaps the most widely adopted
task among all, since it takes compromises between flexibility and simplicity. The
task is to extract a most likely text span from the passage as the answer to the question,
which is usually modeled as predicting the start position idxstar t and end position
idxend of the answer span. To evaluate the predicted answer span â, we typically use
two evaluation metrics proposed by [49]. Exact match assigns full score 1.0 to the
predicted answer span â if it exactly equals the ground truth answer a, otherwise 0.0.
F1 score measures the degree of overlap between â and a by computing a harmonic
mean of the precision and recall.
The free-form answer task such as MS MARCO [43] does not restrict the answer
form or length and is also referred to as generative question answering. It is practical
to model the task as a sequence generation problem, where the discrete token-level
prediction was made. Currently, a consensus on what is the ideal evaluation metrics
has not been achieved. It is common to adopt standard metrics in machine translation
and summarization, including ROUGE [34] and BLEU [57].
As a critical component in the question answering system, the surging neural-
based machine reading comprehension models have greatly boosted the task of
question answering in the last decades.
The first attempt [24] to apply neural networks on machine reading comprehension
constructs bidirectional LSTM reader models along with attention mechanisms. The
work introduces two reader models, i.e., the attentive reader and the impatient reader,
as shown in Fig. 5.11. After encoding the passage and the query into hidden states
using LSTMs, the attentive reader computes a scalar distribution s(t) over the passage
tokens and uses it to compute the weighted sum of the passage hidden states r . The
impatient reader extends this idea further by recurrently updating the weighted sum
of passage hidden states after it has seen each query token.
The attention mechanisms used in reading comprehension could be viewed as
a variant of Memory Networks [64]. Memory Networks use long-term memory
units to store information for inference dynamically. Typically, given an input x,
g r r r
g
r u
u
s(1)y(1) s(4)y(4)
s(2)y(2) s(3)y(3)
the model first converts it into an internal feature representation F(x). Then, the
model can update the designated memory units m i given the new input: m i =
g(m i , F(x), m), or generate output features o given the new input and the mem-
ory states: o = f (F(x), m). Finally, the model converts the output into the response
with the desired format: r = R(o). The key takeaway of Memory Networks is the
retaining and updating of some internal memories that captivate global information.
We will see how this idea is further extended in some sophisticated models.
It is no doubt that the application of attention to machine reading comprehension
greatly promotes researches in this field. Following [11], the work [24] modifies the
method to compute attention and simplify the prediction layer in the attentive reader.
Instead of using tanh(·) to compute the relevance between the passage representa-
tions {p̃i }i=1
n
and the query hidden state q (see Eq. 5.33), Chen et al. use the bilinear
terms to directly capture the passage-query alignment (see Eq. 5.34).
Most machine reading comprehension models follow the same paradigm to locate
the start and endpoint of the answer span. As shown in Fig. 5.12, while encoding the
passage, the model retains the length of the sequence and encodes the question into
a fixed-length hidden representation q. The question’s hidden vector is then used
as a pointer to scan over the passage representation {pi }i=1
n
and compute scores
on every position in the passage. While maintaining this similar architecture, most
machine reading comprehension models vary in the interaction methods between the
passage and the question. In the following, we will introduce several classic reading
comprehension architectures that follow this paradigm.
p1 p2 p3 pn-1 pn
…
114 5 Document Representation
First, we introduce BiDAF, which is short for Bi-Directional Attention Flow [52].
The BiDAF network consists of the token embedding layer, the contextual embedding
layer, the bi-directional attention flow layer, the LSTM modeling layer, and the
softmax output layer, as shown in Fig. 5.13.
The token embedding layer consists of two levels. First, the character embedding
layer encodes each word in character level by adopting a 1D convolutional neural
network (CNN). Specifically, for each word, characters are embedded into fixed-
length vectors, which are considered as 1D input for CNNs. The outputs are then
max-pooled along the embedding dimension to obtain a single fixed-length vector.
Second, the word embedding layer uses pretrained word vectors, i.e., GloVe [47], to
map each word into a high-dimensional vector directly.
Then the concatenation of the two vectors is fed into a two-layer Highway Network
[56]. Equation 5.35 shows one layer of the highway network used in the paper, where
H1 (·) and H2 (·) represent two affine transformations:
After feeding the context and the query to the token embedding layer, we obtain
X ∈ Rd×T for the context and Q ∈ Rd×J for the query, respectively. Afterward, the
contextual embedding layer, which is a bidirectional LSTM, model the temporal
interaction between words for both the context and the query.
Then, come to the attention flow layer. In this layer, the attention dependency
is computed in both directions, i.e., the context-to-query (C2Q) attention and the
query-to-context (Q2C) attention. For both kinds of attention, we first compute a
similarity matrix S ∈ RT ×J using the contextual embeddings of the context H and
the query U obtained from the last layer (Eq. 5.37). In the equation, α(·) computes
the scalar similarity of the given two vectors and m is a trainable weight vector.
Afterward, the LSTM modeling layer takes G as input and encodes it using a
two-layer bidirectional LSTM. The output M ∈ R2d×T is combined with G to yield
the final start and end probability distributions over the passage.
1
N
L =− log(Pid1 x i ) + log(Pid2 x i ) . (5.43)
N i=1 star t star t
αi = Softmax(Q di ) (5.44)
q̃i = Qαi (5.45)
xi = di q̃i . (5.46)
This gated attention mechanism allows the query to directly interact with the token
embeddings of the passage at the semantic level. And such layer-wise interaction
enables the model to learn conditional token representation given the question at
different representation levels.
The Attention-over-Attention Reader [16] takes another path to model the inter-
action. The attention-over-attention mechanism involves calculating the attention
between the passage attention α(t) and the averaged question attention β after obtain-
ing the similarity matrix M ∈ Rn×m (Eq. 5.47). This operation is considered to learn
the contributions of individual question words explicitly.
α(t) = Softmax(M:,t ),
1
N
β= Softmax(Mt,: ). (5.47)
N t=1
Open-domain QA (OpenQA) has been first proposed by [21]. The task aims to
answer open-domain questions using external resources such as collections of docu-
ments [58], web pages [14, 29], structured knowledge graphs [3, 7] or automatically
extracted relational triples [20].
Recently, with the development of machine reading comprehension techniques
[11, 16, 19, 55, 63], researchers attempt to answer open-domain questions via per-
forming reading comprehension on plain texts. Reference [12] proposes to employ
neural-based models to answer open-domain questions. As illustrated in Fig. 5.14,
neural-based OpenQA system usually retrieves relevant texts of the question from a
large-scale corpus and then extracts answers from these texts using reading compre-
hension models.
5.5 Applications 117
Document Document
Retriever Reader
Tim Cook
Answer
The DrQA system consists of two components: (1) The document retriever module
for finding relevant articles and (2) the document reader model for extracting answers
from given contexts.
The document retriever is used as a first quick skim to narrow the searching space
and focus on documents that are likely to be relevant. The retriever builds TF-IDF
weighted bag-of-words vectors for the documents and the questions, and computes
similarity scores for ranking. To further utilize local word order information, the
retriever uses bigram counts with hash while preserving both the speed and memory
efficiency.
The document reader model takes in the top 5 Wikipedia articles yielded by the
document retriever and extracts the final answer to the question. For each article, the
document reader predicts an answer span with a confidence score. The final prediction
is made by maximizing the unnormalized exponential of prediction scores across the
documents.
Given each document d, the document reader first builds feature representation
d̃i for each word in the document. The feature representation d̃ is made up by the
following components.
1. Word embeddings: The word embeddings f emb (d) are obtained from large-scale
GloVe embeddings pretrained on Wikipedia.
2. Manual features: The manual features f token (d) combined part-of-speech (POS)
and named entity recognition tags and normalized Term Frequencies (TF).
3. Exact match: This feature indicates whether di can be exactly matched to one
question word in q.
4. Aligned question embeddings: This feature aims to encode a soft alignment
between words in the document and the question in the word embedding space.
118 5 Document Representation
f align (di ) = αi j E(q j ) (5.48)
j
d̃i = ( f emb (di ), f token (di ), f exact_match (di ), f align (di )). (5.50)
Then the feature representation of the document is fed into a multilayer bidirec-
tional LSTM (BiLSTM) to encode the contextual representation.
exp(u q j )
bj =
(5.53)
j exp(u q j )
q= bjqj. (5.54)
j
In the answer prediction phase, the start and end probability distributions are
calculated following the paradigm mentioned in the Reading Comprehension Model
section (Sect. [Link]).
exp(di Wstar t q)
P star t (i) = star t q)
(5.55)
i exp(di W
exp(di Wend q)
P end (i) = end q)
. (5.56)
i exp(di W
Despite its success, the DrQA system is prone to noise in retrieved texts which may
hurt the performance of the system. Hence, [15] and [61] attempt to solve the noise
5.5 Applications 119
problem in DrQA via separating the question answering into paragraph selection and
answer extraction, and they both only select the most relevant paragraph among all
retrieved paragraphs to extract answers. They lose a large amount of rich information
contained in those neglected paragraphs. Hence, [62] proposes strength-based and
coverage-based re-ranking approaches, which can aggregate the results extracted
from each paragraph by the existing DS-QA system to determine the answer better.
However, the method relies on the pre-extracted answers of existing DS-QA models
and still suffers the noise issue in distant supervision data because it considers all
retrieved paragraphs indiscriminately. To address this issue, [35] proposes a coarse-
to-fine denoising OpenQA model, which employs a paragraph selector to filter out
paragraphs and a paragraph reader to extract the correct answer from those denoised
paragraphs.
5.6 Summary
References
1. Amr Ahmed, Mohamed Aly, Joseph Gonzalez, Shravan Narayanamurthy, and Alexander
Smola. Scalable inference in latent variable models. In Proceedings of WSDM, 2012.
2. Arthur Asuncion, Max Welling, Padhraic Smyth, and Yee Whye Teh. On smoothing and infer-
ence for topic models. In Proceedings of UAI, 2009.
3. Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase
from question-answer pairs. In Proceedings of EMNLP, 2013.
4. David M Blei, Thomas L Griffiths, and Michael I Jordan. The nested chinese restaurant process
and bayesian nonparametric inference of topic hierarchies. The Journal of the ACM, 57(2):7,
2010.
5. David M Blei and John D Lafferty. Dynamic topic models. In Proceedings of ICML, 2006.
6. David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of
Machine Learning Research, 3:993–1022, 2003.
7. Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. Large-scale simple ques-
tion answering with memory networks. arXiv preprint arXiv:1506.02075, 2015.
8. Jordan L Boyd-Graber and David M Blei. Syntactic topic models. In Proceedings of NeurIPS,
2009.
9. Jonathan Chang and David M Blei. Hierarchical relational models for document networks. The
Annals of Applied Statistics, pages 124–150, 2010.
10. Danqi Chen. Neural Reading Comprehension and Beyond. PhD thesis, Stanford University,
2018.
11. Danqi Chen, Jason Bolton, and Christopher D. Manning. A thorough examination of the
cnn/daily mail reading comprehension task. In Proceedings of ACL, 2016.
12. Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading wikipedia to answer
open-domain questions. In Proceedings of the ACL, 2017.
13. Jianfei Chen, Kaiwei Li, Jun Zhu, and Wenguang Chen. Warplda: a cache efficient o (1)
algorithm for latent dirichlet allocation. Proceedings of VLDB, 2016.
14. Tongfei Chen and Benjamin Van Durme. Discriminative information retrieval for question
answering sentence selection. In Proceedings of EACL, 2017.
References 121
15. Eunsol Choi, Daniel Hewlett, Jakob Uszkoreit, Illia Polosukhin, Alexandre Lacoste, and
Jonathan Berant. Coarse-to-fine question answering for long documents. In Proceedings of
ACL, 2017.
16. Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang, Ting Liu, and Guoping Hu. Attention-over-
attention neural networks for reading comprehension. In Proceedings of ACL, 2017.
17. Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. Convolutional neural networks
for soft-matching n-grams in ad-hoc search. In Proceedings of WSDM, 2018.
18. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of
deep bidirectional transformers for language understanding. In Proceedings of NAACL, 2019.
19. Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang, William Cohen, and Ruslan Salakhutdinov. Gated-
attention readers for text comprehension. In Proceedings of ACL, 2017.
20. Anthony Fader, Luke Zettlemoyer, and Oren Etzioni. Open question answering over curated
and extracted knowledge bases. In Proceedings of SIGKDD, 2014.
21. Bert F Green Jr, Alice K Wolf, Carol Chomsky, and Kenneth Laughery. Baseball: an automatic
question-answerer. In Proceedings of IRE-AIEE-ACM, 1961.
22. Thomas L Griffiths, Mark Steyvers, David M Blei, and Joshua B Tenenbaum. Integrating topics
and syntax. In Proceedings of NeurIPS, 2004.
23. Jiafeng Guo, Yixing Fan, Qingyao Ai, and [Link] Croft. A deep relevance matching model
for ad-hoc retrieval. In Proceedings of CIKM, 2016.
24. Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa
Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. In Proceedings of
NeurIPS, 2015.
25. Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. Convolutional neural network archi-
tectures for matching natural language sentences. In Proceedings of NeurIPS, 2014.
26. Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning
deep structured semantic models for web search using clickthrough data. In Proceedings of
CIKM, 2013.
27. Kai Hui, Andrew Yates, Klaus Berberich, and Gerard de Melo. Pacrr: A position-aware neural
ir model for relevance matching. In Proceedings of EMNLP, 2017.
28. Yangfeng Ji, Trevor Cohn, Lingpeng Kong, Chris Dyer, and Jacob Eisenstein. Document context
language models. arXiv preprint arXiv:1511.03962, 2015.
29. Cody Kwok, Oren Etzioni, and Daniel S Weld. Scaling question answering to the web. TOIS,
pages 242–262, 2001.
30. Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE: Large-scale
reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017.
31. Quoc V Le and Tomas Mikolov. Distributed representations of sentences and documents. In
Proceedings of ICML, 2014.
32. Canjia Li, Yingfei Sun, Ben He, Le Wang, Kai Hui, Andrew Yates, Le Sun, and Jungang
Xu. Nprf: A neural pseudo relevance feedback framework for ad-hoc information retrieval. In
Proceedings of EMNLP, 2018.
33. Jiwei Li, Thang Luong, and Dan Jurafsky. A hierarchical neural autoencoder for paragraphs
and documents. In Proceedings of ACL, 2015.
34. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. Text Summarization
Branches Out, 2004.
35. Yankai Lin, Haozhe Ji, Zhiyuan Liu, and Maosong Sun. Denoising distantly supervised open-
domain question answering. In Proceedings of ACL, 2018.
36. Zhenghao Liu, Chenyan Xiong, Maosong Sun, and Zhiyuan Liu. Entity-duet neural ranking:
Understanding the role of knowledge graph semantics in neural information retrieval. In Pro-
ceedings of ACL, 2018.
37. Jon D Mcauliffe and David M Blei. Supervised topic models. In Proceedings of NeurIPS, 2008.
38. T Mikolov and J Dean. Distributed representations of words and phrases and their composi-
tionality. Proceedings of NeurIPS, 2013.
39. David Mimno and Andrew McCallum. Topic models conditioned on arbitrary features with
dirichlet-multinomial regression. In Proceedings of UAI, 2008.
122 5 Document Representation
40. Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed
representations of text for web search. In Proceedings of WWW, 2017.
41. David Newman, Arthur U Asuncion, Padhraic Smyth, and Max Welling. Distributed inference
for latent dirichlet allocation. In Proceedings of NeurIPS, 2007.
42. David Newman, Chaitanya Chemudugunta, and Padhraic Smyth. Statistical entity-topic mod-
els. In Proceedings of SIGKDD, 2006.
43. Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and
Li Deng. MS MARCO: A human generated machine reading comprehension dataset. arXiv
preprint arXiv:1611.09268, 2016.
44. Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, Jianshu Chen, Xinying
Song, and Rabab Ward. Deep sentence embedding using long short-term memory networks:
Analysis and application to information retrieval. IEEE/ACM Transactions on Audio, Speech
and Language Processing, 24(4):694–707, 2016.
45. Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng. Text match-
ing as image recognition. In Proceedings of AAAI, 2016.
46. Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Jingfang Xu, and Xueqi Cheng. Deeprank: A
new deep architecture for relevance ranking in information retrieval. In Proceedings of CIKM,
2017.
47. Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word
representation. In Proceedings of EMNLP, 2014.
48. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of NAACL-
HLT, pages 2227–2237, 2018.
49. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ ques-
tions for machine comprehension of text. In Proceedings of EMNLP, 2016.
50. Matthew Richardson, Christopher JC Burges, and Erin Renshaw. MCTest: A challenge dataset
for the open-domain machine comprehension of text. In Proceedings of EMNLP, 2013.
51. Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. The author-topic
model for authors and documents. In Proceedings of UAI, 2004.
52. Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional atten-
tion flow for machine comprehension. In Proceedings of ICLR, 2017.
53. Aliaksei Severyn and Alessandro Moschitti. Learning to rank short text pairs with convolutional
deep neural networks. In Proceedings of SIGIR, 2015.
54. Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. A latent semantic
model with convolutional-pooling structure for information retrieval. In Proceedings of CIKM,
2014.
55. Yelong Shen, Po-Sen Huang, Jianfeng Gao, and Weizhu Chen. Reasonet: Learning to stop
reading in machine comprehension. In Proceedings of SIGKDD, 2017.
56. Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. arXiv
preprint arXiv:1505.00387, 2015.
57. A Cuneyd Tantug, Kemal Oflazer, and Ilknur Durgar El-Kahlout. Bleu+: a tool for fine-grained
bleu computation. 2008.
58. Ellen M Voorhees et al. The trec-8 question answering track report. In Proceedings of TREC,
1999.
59. Hanna M Wallach. Topic modeling: beyond bag-of-words. In Proceedings of ICML, 2006.
60. Chong Wang, Bo Thiesson, Chris Meek, and David Blei. Markov topic models. In Proceedings
of AISTATS, 2009.
61. Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, Shiyu Chang,
Gerald Tesauro, Bowen Zhou, and Jing Jiang. R3: Reinforced ranker-reader for open-domain
question answering. In Proceedings of AAAI, 2018.
62. Shuohang Wang, Mo Yu, Jing Jiang, Wei Zhang, Xiaoxiao Guo, Shiyu Chang, Zhiguo Wang,
Tim Klinger, Gerald Tesauro, and Murray Campbell. Evidence aggregation for answer re-
ranking in open-domain question answering. In Proceedings of ICLR, 2018.
References 123
63. Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. Gated self-matching
networks for reading comprehension and question answering. In Proceedings of ACL, 2017.
64. Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. arXiv
preprint arXiv:1410.3916, 2014.
65. Chenyan Xiong, Jamie Callan, and Tie-Yan Liu. Word-entity duet representations for document
ranking. In Proceedings of SIGIR, 2017.
66. Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. End-to-end neural
ad-hoc ranking with kernel pooling. In Proceedings of SIGIR, 2017.
67. Kaitao Zhang, Chenyan Xiong, Zhenghao Liu, and Zhiyuan Liu. Selective weak supervision
for neural information retrieval. arXiv preprint arXiv:2001.10382, 2020.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License ([Link] which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.