0% found this document useful (0 votes)
24 views33 pages

Document Representation

Chapter 5 discusses document representation in natural language processing, focusing on encoding semantic information into vector representations for tasks like information retrieval and question answering. It covers traditional methods such as one-hot and bag-of-words representations, as well as advanced techniques like topic models and neural document representations. The chapter emphasizes the importance of effective document representation in managing the increasing volume of digital text and introduces real-world applications of these methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views33 pages

Document Representation

Chapter 5 discusses document representation in natural language processing, focusing on encoding semantic information into vector representations for tasks like information retrieval and question answering. It covers traditional methods such as one-hot and bag-of-words representations, as well as advanced techniques like topic models and neural document representations. The chapter emphasizes the importance of effective document representation in managing the increasing volume of digital text and introduces real-world applications of these methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Chapter 5

Document Representation

Abstract A document is usually the highest linguistic unit of natural language.


Document representation aims to encode the semantic information of the whole
document into a real-valued representation vector, which could be further utilized
in downstream tasks. Recently, document representation has become an essential
task in natural language processing and has been widely used in many document-
level real-world applications such as information retrieval and question answering.
In this chapter, we first introduce the one-hot representation for documents. Next,
we extensively introduce topic models that learn the topic distribution of words and
documents. Further, we give an introduction to distributed document representation,
including paragraph vector and neural document representations. Finally, we intro-
duce several typical real-world applications of document representation, including
information retrieval and question answering.

5.1 Introduction

Advances in information and communication technologies offer ubiquitous access to


vast amounts of information and are causing an exponential increase in the number
of documents available online. While more and more textual information is avail-
able electronically, effective retrieval and mining are getting more and more difficult
without the efficient organization, summarization, and indexing of document content.
Therefore, document representation is playing an important role in many real-world
applications, e.g., document retrieval, web search, and spam filtering. Document rep-
resentation aims to represent document input into a fixed-length vector, which could
describe the contents of the document, to reduce the complexity of the documents
and make them easier to handle. Traditional document representation models such
as one-hot document representation have achieved promising results in many docu-
ment classification and clustering tasks due to their simplicity, efficiency, and often
surprising accuracy.
However, the one-hot document representation model has many disadvantages.
First, it loses the word order, and thus, different documents can have the same repre-
sentation, as long as the same words are used. Second, it usually suffers data sparsity

© The Author(s) 2020 91


Z. Liu et al., Representation Learning for Natural Language Processing,
[Link]
92 5 Document Representation

and high dimensionality. One-hot document representation model has very little sense
about the semantics of the words or, more formally, the distances between the words.
Hence, the approach for representing text documents uses multi-word terms as vector
components, which are noun phrases extracted using a combination of linguistic and
statistical criteria. This representation is motivated by the notion of topic models that
terms should contain more semantic information than individual words. And another
advantage of using terms for representing a document is its lower dimensionality
compared with the traditional one-hot document representation.
Nevertheless, applying these to generation tasks remains difficult. To understand
how discourse units are connected, one has to understand the communicative function
of each unit, and the role it plays within the context that encapsulates it, recursively
all the way up for the entire text. Identifying increasingly sophisticated human-
developed features may be insufficient for capturing these patterns, but developing
representation-based alternatives has also been difficult. Although document repre-
sentation can capture aspects of coherent sentence structure, it is not clear how it
could help in generating more broadly cohesive text.
Recently, neural network models have shown compelling results in generating
meaningful and grammatical documents in sequence generation tasks like machine
translation or parsing. It is partially attributed to the ability of these systems to cap-
ture local compositionally: the way neighboring words are combined semantically
and syntactically to form meanings that they wish to express. Based on neural net-
work models, many research works have developed a variety of ways to incorporate
document-level contextual information. These models are all hybrid architectures in
that they are recurrent at the sentence level, but use a different structure to summarize
the context outside the sentence. Furthermore, some models explore multilevel recur-
rent architectures for combining local and global information in language modeling.
In this chapter, we first introduce the one-hot representation for documents. Next,
we extensively introduce topic models that aim to learn latent topic distributions of
words and documents. Further, we give an introduction on distributed document rep-
resentation including paragraph vector and neural document representations. Finally,
we introduce several typical real-world applications of document representations,
including information retrieval and question answering.

5.2 One-Hot Document Representation

Majority of machine learning algorithms take a fixed-length vector as the input,


so documents are needed to be represented as vectors. The bag-of-words model
is the most common and simple representation method for documents. Similar to
one-hot sentence representation, for a document d = {w1 , w2 , . . . , wl }, a bag-of-
word representation d can be used to represent this document. Specifically, for
a vocabulary V = [w1 , w2 , . . . , w|V | ] , the one-hot representation of word w is
5.2 One-Hot Document Representation 93

w = [0, 0, . . . , 1, . . . , 0]. Based on the one-hot word representation and a vocab-


ulary V , it can be extended to represent a document as


l
d= wi , (5.1)
k=1

where l is the length of the document d. And similar to one-hot sentence represen-
tation, the TF-IDF method is also proposed to enhance the ability of bag-of-words
representation in reflecting how important a word is to a document in a corpus.
Actually, the bag-of-words representation is mainly used as a tool of feature gen-
eration, and the most common type of features calculated from this method is word
frequency appearing in the documents. This method is simple but efficient and some-
times can reach excellent performance in many real-world applications. However, the
bag-of-words representation still ignores entirely the word order information, which
means different documents can have the same representation as long as the same
words are used. Furthermore, bag-of-words representation has little sense about the
semantics of the words or, more formally, the distances between words, which means
this method cannot utilize rich information hidden in the word representations.

5.3 Topic Model

As our collective knowledge continues to be digitized and stored in the form of


news, blogs, web pages, scientific articles, books, images, audio, videos, and social
networks, it becomes more difficult to find and discover what we are looking for. We
need new computational tools to help organize, search, and understand these vast
amounts of information.
Right now, we work with online information using two main tools—search and
links. We type keywords into a search engine and find a set of documents related to
them. We look at the documents in that set, possibly navigating to other linked doc-
uments. This is a powerful way of interacting with our online archive, but something
is missing.
Imagine searching and exploring documents based on the themes that run through
them. We might “zoom in” and “zoom out” to find specific or broader themes; we
might look at how those themes changed through time or how they are connected.
Rather than finding documents through keyword search alone, we might first find
the theme that we are interested in, and then examine the documents related to that
theme.
For example, consider using themes to explore the complete history of the New
York Times. At a broad level, some of the themes might correspond to the sections
of the newspaper, such as foreign policy, national affairs, and sports. We could zoom
in on a theme of interest, such as foreign policy, to reveal various aspects of it, such
as Chinese foreign policy, the conflict in the Middle East, and the United States’
94 5 Document Representation

relationship with Russia. We could then navigate through time to reveal how these
specific themes have changed, tracking, for example, the changes in the conflict in
the Middle East over the last 50 years. And, in all of this exploration, we would be
pointed to the original articles relevant to the themes. The thematic structure would
be a new kind of window through which to explore and digest the collection.
But we do not interact with electronic archives in this way. While more and more
texts are available online, we do not have the human power to read and study them to
provide the kind of browsing experience described above. To this end, machine learn-
ing researchers have developed probabilistic topic modeling, a suite of algorithms
that aim to discover and annotate vast archives of documents with thematic informa-
tion. Topic modeling algorithms are statistical methods that analyze the words of the
original texts to explore the themes that run through them, how those themes are con-
nected, and how they change over time. Topic modeling algorithms do not require
any prior annotations or labeling of the documents. The topics emerge from the
analysis of the original texts. Topic modeling enables us to organize and summarize
electronic archives at a scale that would be impossible by human annotation.

5.3.1 Latent Dirichlet Allocation

A variety of probabilistic topic models have been used to analyze the content of
documents and the meaning of words. Hofmann first introduced the probabilistic
topic approach to document modeling in his Probabilistic Latent Semantic Indexing
method (pLSI). The pLSI model does not make any assumptions about how the
mixture weights are generated, making it difficult to test the generalization ability of
the model to new documents. Thus, Latent Dirichlet Allocation (LDA) was extended
from this model by introducing a Dirichlet prior to the model. LDA is believed as a
simple but efficient topic model. We first describe the basic ideas of LDA [6].
The intuition behind LDA is that documents exhibit multiple topics. LDA is a
statistical model of document collections that tries to capture this intuition. It is most
easily described by its generative process, the imaginary random process by which
the model assumes the documents arose.
We formally define a topic to be a distribution over a fixed vocabulary. We assume
that these topics are specified before any data has been generated. Now for each
document in the collection, we generate the words in a two-stage process.

1. Randomly choose a distribution over topics.


2. For each word in the document,
• Randomly choose a topic from the distribution over topics in step #1.
• Randomly choose a word from the corresponding distribution over the vocab-
ulary.

This statistical model reflects the intuition that documents exhibit multiple topics.
Each document exhibits the topics with different proportions (step #1); each word in
5.3 Topic Model 95

each document is drawn from one of the topics (step #2b), where the selected topic
is chosen from the per-document distribution over topics (step #2a).
We emphasize that the algorithms have no information about these subjects and the
articles are not labeled with topics or keywords. The interpretable topic distributions
arise by computing the hidden structure that likely generated the observed collection
of documents.

[Link] LDA and Probabilistic Models

LDA and other topic models are part of the broader field of probabilistic modeling.
In generative probabilistic modeling, we treat our data as arising from a generative
process that includes hidden variables. This generative process defines a joint prob-
ability distribution over both the observed and hidden random variables. Given the
observed variables, we perform data analysis by using that joint distribution to com-
pute the conditional distribution of the hidden variables. This conditional distribution
is also called the posterior distribution.
LDA falls precisely into this framework. The observed variables are the words of
the documents, the hidden variables are the topic structure, and the generative process
is as described above. The computational problem of inferring the hidden topic
structure from the documents is the problem of computing the posterior distribution,
the conditional distribution of the hidden variables given the documents.
We can describe LDA more formally with the following notation. The topics are
β1:K , where each βk is a distribution over the vocabulary. The topic proportions for the
dth document are θd , where θdk is the topic proportion for topic k in document d. The
topic assignments for the dth document are z d , where z d,n is the topic assignment
for the nth word in document d. Finally, the observed words for document d are
wd , where wd,n is the nth word in document d, which is an element from the fixed
vocabulary.
With this notation, the generative process for LDA corresponds to the following
joint distribution of the hidden and observed variables:


K 
D 
N
P(β1:K , θ1:D , z 1:D , w1:D ) = P(βi ) P(θd )( P(z d,n |θd )P(wd,n |β1:K , z d,n ).
i=1 d=1 n=1
(5.2)
Notice that this distribution specifies the number of dependencies. For example,
the topic assignment z d,n depends on the per-document topic proportions θd . As
another example, the observed word wd,n depends on the topic assignment z d,n and
all of the topics β1:K .
These dependencies define LDA. They are encoded in the statistical assumptions
behind the generative process, in the particular mathematical form of the joint distri-
bution, and in a third way, in the probabilistic graphical model for LDA. Probabilistic
graphical models provide a graphical language for describing families of probability
96 5 Document Representation

α θd zd,n wd,n βi η
N D K
Fig. 5.1 The architecture of graphical model for Latent Dirichlet Allocation

distributions. The graphical model for LDA is in Fig. 5.1. Each node is a random vari-
able and is labeled according to its role in the generative process. The hidden nodes,
the topic proportions, assignments, and topics are unshaded. The observed nodes and
the words of the documents, are shaded. We use rectangles as plate notation to denote
replication. The N plate denotes the collection of words within documents; the D
plate denotes the collection of documents within the collection. These three repre-
sentations are equivalent ways of describing the probabilistic assumptions behind
LDA.

[Link] Posterior Computation for LDA

We now turn to the computational problem, computing the conditional distribution


of the topic structure given the observed documents. (As we mentioned above, this
is called the posterior.) Using our notation, the posterior is

P(β1:K , θ1:D , z 1:D , v1:D )


P(β1:K , θ1:D , z 1:D |v1:D ) = . (5.3)
P(v1:D )

The numerator is the joint distribution of all the random variables, which can
be easily computed for any setting of the hidden variables. The denominator is
the marginal probability of the observations, which is the probability of seeing the
observed corpus under any topic model. In theory, it can be computed by summing
the joint distribution over every possible instantiation of the hidden topic structure.
Topic modeling algorithms form an approximation of the above equation by form-
ing an alternative distribution over the latent topic structure that is adapted to be close
to the true posterior. Topic modeling algorithms generally fall into two categories:
sampling-based algorithms and variational algorithms.
Sampling-based algorithms attempt to collect samples from the posterior by
approximating it with an empirical distribution. The most commonly used sampling
algorithm for topic modeling is Gibbs sampling, where we construct a Markov chain,
a sequence of random variables, each dependent on the previous—whose limiting
distribution is posterior. The Markov chain is defined on the hidden topic variables
for a particular corpus, and the algorithm is to run the chain for a long time, collect
samples from the limiting distribution, and then approximate the distribution with
the collected samples.
5.3 Topic Model 97

Variational methods are a deterministic alternative to sampling-based algorithms.


Rather than approximating the posterior with samples, variational methods posit a
parameterized family of distributions over the hidden structure and then find the
member of that family that is closest to the posterior. Thus, the inference problem
is transformed into an optimization problem. Variational methods open the door for
innovations in optimization to have a practical impact on probabilistic modeling.

5.3.2 Extensions

The simple LDA model provides a powerful tool for discovering and exploiting
the hidden thematic structure in large archives of text. However, one of the main
advantages of formulating LDA as a probabilistic model is that it can easily be
used as a module in more complicated models for more complex goals. Since its
introduction, LDA has been extended and adapted in many ways.

[Link] Relaxing the Assumptions of LDA

LDA is defined by the statistical assumptions it makes about the corpus. One active
area of topic modeling research is how to relax and extend these assumptions to
uncover a more sophisticated structure in the texts.
One assumption that LDA makes is the bag-of-words assumption that the order
of the words in the document does not matter. While this assumption is unrealistic, it
is reasonable if our only goal is to uncover the coarse semantic structure of the texts.
For more sophisticated goals, such as language generation, it is patently not appropri-
ate. There have been many extensions to LDA that model words non-exchangeable.
For example, [59] develops a topic model that relaxes the bag-of-words assumption
by assuming that the topics generate words conditional on the previous word; [22]
develops a topic model that switches between LDA and a standard HMM. These mod-
els expand the parameter space significantly but show improved language modeling
performance.
Another assumption is that the order of documents does not matter. Again, this
can be seen by noticing that Eq. 5.3 remains invariant to permutations of the ordering
of documents in the collection. This assumption may be unrealistic when analyzing
long-running collections that span years or centuries. In such collections, we may
want to assume that the topics change over time. One approach to this problem is the
dynamic topic model [5], a model that respects the ordering of the documents and
gives a more productive posterior topical structure than LDA.
The third assumption about LDA is that the number of topics is assumed known
and fixed. The Bayesian nonparametric topic model provides an elegant solution: The
collection determines the number of topics during posterior inference, and new doc-
uments can exhibit previously unseen topics. Bayesian nonparametric topic models
98 5 Document Representation

have been extended to hierarchies of topics, which find a tree of topics, moving from
more general to more concrete, whose particular structure is inferred from the data
[4].

[Link] Incorporating Meta-Data to LDA

In many text analysis settings, the documents contain additional information such
as author, title, geographic location, links, and others that we might want to account
for when fitting a topic model. There has been a flurry of research on adapting topic
models to include meta-data.
The author-topic model [51] is an early success story for this kind of research. The
topic proportions are attached to authors; papers with multiple authors are assumed
to attach each word to an author, drawn from a topic drawn from his or her topic
proportions. The author-topic model allows for inferences about authors as well as
documents.
Many document collections are linked. For example, scientific papers are linked by
citations, or web pages are connected by hyperlinks. And several topic models have
been developed to account for those links when estimating the topics. The relational
topic model of [9] assumes that each document is modeled as in LDA and that the
links between documents depend on the distance between their topic proportions.
This is both a new topic model and a new network model. Unlike traditional statistical
models of networks, the relational topic model takes into account node attributes in
modeling the links.
Other work that incorporates meta-data into topic models includes models of
linguistic structure [8], models that account for distances between corpora [60], and
models of named entities [42]. General-purpose methods for incorporating meta-
data into topic models include Dirichlet-multinomial regression models [39] and
supervised topic models [37].

[Link] Acceleration

In the existing fast algorithms, it is difficult to decouple the access to Cd and Cw


because both counts need to be updated instantly after the sampling of every token.
Many algorithms have been proposed to accelerate LDA based on this equation.
WarpLDA [13] is built based on a new Monte Carlo Expectation Maximization
(MCEM) algorithm, which is similar to CGS, but both counts are fixed until the
sampling of all tokens is finished. This scheme can be used to develop a reordering
strategy to decouple the accesses to Cd and Cw , and minimize the size of randomly
accessed memory.
Specifically, WarpLDA seeks a MAP solution of the latent variables Θ and Φ,
with the latent topic assignments Z integrated out: where α  and β  are the Dirichlet
hyperparameters. Reference [2] has shown that this MAP solution is almost identical
with the solution of CGS, with proper hyperparameters.
5.3 Topic Model 99

Computing log P(Θ, Φ|W, α  , β  ) directly is expensive because it needs to enu-


merate all the K possible topic assignments for each token. We, therefore, optimize
its lower bound as a surrogate. Let Q(Z ) be a variational distribution. Then, by
Jensen’s inequality, the lower bound can be J (Θ, Φ, Q(Z )):

log P(Θ, Φ|W, α  , β  ) ≥E Q [log P(W, Z |Θ, Φ) − log Q(Z )] + log P(Θ|α  ) + log P(Φ|β  )
J (Θ, Φ, Q(Z )). (5.4)

An Expectation Maximization (EM) algorithm is implemented to find a local


maximum of the posterior P(Θ, Φ|W, α  , β  ), where the E-step maximizes J with
respect to the variational distribution Q(Z ) and the M-step maximizes J with
respect to the model parameters (Θ, Φ), while keeping Q(Z ) fixed. One can prove
that the optimal solution at E-step is Q(Z ) = P(Z |W, Θ, Φ) without further assump-
tion on Q. We apply Monte Carlo approximation on the expectation in Eq. 5.4,

1
S
E Q [log P(W, Z |Θ, Φ) − log Q(Z )] ≈ log P(W, Z (s) |Θ, Φ) − log Q(Z (s) ),
S s=1
(5.5)

where Z (1) , . . . , Z (S) ∼ Q(Z ) = P(Z |W, Θ, Φ). The sample size is set as S = 1
and the model uses Z as an abbreviation of Z (1) .
Sampling Z : Each dimension of Z can be sampled independently:

Q(z d,n = k) ∝ P(W, Z |Θ, Φ) ∝ θdk φwd,n ,k . (5.6)

Optimizing Θ, Φ: With the Monte Carlo approximation, we have

J ≈ log P(W, Z |Θ, Φ) + log P(Θ|α  ) + log P(Φ|β  ) + const.


 
= (Cdk + αk − 1) log θdk + (Ckw + β  − 1) log φkw + const., (5.7)
d,k k,w

and with the optimal solutions, we have

Cwk + β  − 1
θ̂dk ∝ Cdk + αk − 1, φ̂wk = . (5.8)
Ck + β̄  − V

Instead of computing and storing Θ̂ and Φ̂, we compute and store Cd and Cw to
save memory because the latter are sparse. Plug Eqs. 5.8–5.6, and let α = α  − 1, β =
β  − 1, we get the full MCEM algorithm, which iteratively performs the following
two steps until a given iteration number is reached:
100 5 Document Representation

• E-step: We can sample z d,n ∼ Q(z d,n = k) according to

Cwk + βw
Q(z d,n = k) ∝ (Cdk + αk ) . (5.9)
Ck + β̄

• M-step: Compute Cd and Cw by Z .


Note the resemblance intuitively justifies why MCEM leads to similar results with
CGS. The difference between MCEM and CGS is that MCEM updates the counts
Cd and Cw after sampling all z d,n s, while CGS updates the counts instantly after
sampling each z d,n . The strategy that MCEM updates the counts after sampling all
z d,n s is called delayed count update, or simply delayed update. MCEM can be viewed
as a CGS with a delayed update, which has been widely used in other algorithms
[1, 41]. While previous work uses the delayed update as a trick, we at this moment
present a theoretical guarantee to converge to a MAP solution. The delayed update
is essential for us to decouple the accesses of Cd and Cw to improve cache locality,
without affecting the correctness.

5.4 Distributed Document Representation

To address the disadvantages of bag-of-words document representation, [31] pro-


poses paragraph vector models, including the version with Distributed Memory
(PV-DM) and the version with Distributed Bag-of-Words (PV-DBOW). Moreover,
researchers also proposed several hierarchical neural network models to represent
documents. In this section, we will introduce these models in detail.

5.4.1 Paragraph Vector

As shown in Fig. 5.2, paragraph vector maps every paragraph to a unique vector,
represented by a column in the matrix P and maps every word to a unique vector,
represented by a column in word embedding matrix E. The paragraph vector and
word vectors are averaged or concatenated to predict the next word in a context. More
formally, compared to the word vector framework, the only change in this model is
in the following equation, where h is constructed from E and P.

y = Softmax(h(wt−k , . . . , wt+k ; E, P)), (5.10)

where h is constructed by the concatenation or average of word vectors extracted


from E and P.
5.4 Distributed Document Representation 101

Classifier wi+3

Average/Concatenate

Paragraph Matrix D W W W

Paragraph
wi wi+1 wi+2
id

Fig. 5.2 The architecture of PV-DM model

The other part of this model is that given a sequence of training words w1 , w2 , w3 ,
…, wl , the objective of the paragraph vector model is to maximize the average log
probability:
1
l−k
O= log P(wi | wi−k , . . . , wi+k ). (5.11)
l i=k

And the prediction task is typically done via a multi-class classifier, such as
softmax. Thus, the probability equation is

e ywi
P(wi | wi−k , . . . , wi+k ) =  y j . (5.12)
j e

The paragraph token can be thought of as another word. It acts as a memory that
remembers what is missing from the current context, or the topic of the paragraph. For
this reason, this model is often called the Distributed Memory Model of Paragraph
Vectors (PV-DM).
The above method considers the concatenation of the paragraph vector with the
word vectors to predict the next word in a text window. Another way is to ignore the
context words in the input, but force the model to predict words randomly sampled
from the paragraph in the output. In reality, what this means is that at each iteration
of stochastic gradient descent, we sample a text window, then sample a random word
from the text window and form a classification task given the Paragraph Vector. This
technique is shown in Fig. 5.3. This version is named the Distributed Bag-of-Words
version of Paragraph Vector (PV-DBOW), as opposed to the Distributed Memory
version of Paragraph Vector (PV-DM) in the previous section.
102 5 Document Representation

Classifier wi wi+1 wi+2 wi+3

Paragraph Matrix D

Paragraph
id

Fig. 5.3 The architecture of PV-DBOW model

In addition to being conceptually simple, this model requires to store fewer data.
The data only needed to be stored is the softmax weights as opposed to both softmax
weights and word vectors in the previous model. This model is also similar to the
Skip-gram model in word vectors.

5.4.2 Neural Document Representation

In this part, we introduce two main kinds of neural networks for document repre-
sentation including document-context language model and hierarchical document
autoencoder.

[Link] Document-Context Language Model

Recurrent architectures can be used to combine local and global information in doc-
ument language modeling. The simplest such model would be to train a single RNN,
ignoring sentence boundaries as mentioned above; the last hidden state from the pre-
vious sentence t − 1 is used to initialize the first hidden state in sentence t. In such
an architecture, the length of the RNN is equal to the number of tokens in the docu-
ment; in typical genres such as news texts, this means training RNNs from sequences
of several hundred tokens, which introduces two problems: (1) Information decay
In a sentence with thirty tokens (not unusual in news text), the contextual informa-
tion from the previous sentence must be propagated through the recurrent dynamics
thirty times before it can reach the last token of the current sentence. Meaningful
document-level information is unlikely to survive such a long pipeline. (2) Learning
It is notoriously difficult to train recurrent architectures that involve many time steps.
5.4 Distributed Document Representation 103

In the case of an RNN trained on an entire document, back-propagation would have


to run over hundreds of steps, posing severe numerical challenges.
To address these two issues, [28] proposes to use multilevel recurrent structures
to represent documents, thereby successfully efficiently leveraging document-level
context in language modeling. They first proposed Context-to-Context Document-
Context Language Model (ccDCLM), which assumes that contextual information
from previous sentences needs to be able to “short-circuit” the standard RNN, so as
to more directly impact the generation of words across longer spans of text. Formally,
we have
ct−1 = ht−1,l , (5.13)

where l is the length of sentence t − 1. The ccDCLM model then creates additional
paths for this information to impact each hidden representation in the current sentence
t. Writing wt,n for the word representation of the nth word in the tth sentence, we
have

ht,n =gθ (ht,n−1 , f (wt,n , ct−1 ), (5.14)

where gθ (·) is the activation function parameterized by θ and f (·) is a function that
combines the context vector with the input xt,n for the hidden state. Here we simply
concatenate the representations,

f (xt,n , ct−1 ) = [xt,n ; ct−1 ]. (5.15)

The emission probability for yt,n is then computed from ht,n as in the standard
RNNLM. The underlying assumption of this model is that contextual information
should impact the generation of each word in the current sentence. The model,
therefore, introduces computational “short-circuits” for cross-sentence information,
as illustrated in Fig. 5.4.

yt,1 yt,2 yt,N-1 yt,N

yt-1,1 yt-1,2 yt-1,M-1 yt-1,M

xt,1 xt,2 xt,N-1 xt,N

xt-1,1 xt-2,2 xt-1,M-1 xt-1,M

Fig. 5.4 The architecture of ccDCLM model


104 5 Document Representation

yt-1,1 yt-1,2 yt-1,M-1 yt-1,M yt,1 yt,2 yt,N-1 yt,N

xt-1,1 xt-2,2 xt-1,M- xt-1,M xt,1 xt,2 xt,N-1 xt,N

Fig. 5.5 The architecture of coDCLM model

Besides, they also proposed Context-to-Output Document-Context Language


Model (coDCLM). Rather than incorporating the document context into the recurrent
definition of the hidden state, the coDCLM model pushes it directly to the output, as
illustrated in Fig. 5.5. Let ht,n be the hidden state from a conventional RNNLM of
sentence t,
ht,n = gθ (ht,n−1 , xt,n ). (5.16)

Then, the context vector ct−1 is directly used in the output layer as

yt,n ∼ Softmax(Wh ht,n + Wc ct−1 + b). (5.17)

[Link] Hierarchical Document Autoencoder

Reference [33] also proposes hierarchical document autoencoder to represent doc-


uments. The model draws on the intuition that just as the juxtaposition of words
creates a joint meaning of a sentence, the juxtaposition of sentences also creates a
joint meaning of a paragraph or a document.
They first obtain representation vectors at the sentence level by putting one layer
of LSTM (denoted as LSTMwor d
encode ) on top of its containing words:

h wt (enc) = LSTMwor
encode (wt , h t−1 (enc)).
d v
(5.18)

The vector output at the ending time step is used to represent the entire sentence
as
s = h wends . (5.19)

To build representation e D for the current document/paragraph, another layer of


LSTM (denoted as LSTMsentence
encode ) is placed on top of all sentences, computing rep-
resentations sequentially for each time step:
5.4 Distributed Document Representation 105

food any find didn’t she . hungry was mary

Decode -Word

Decode-Sentence

Encode-Sentence

Encode -Word

Mary was hungry . she didn’t find any food

Fig. 5.6 The architecture of hierarchical document autoencoder

h st (enc) = LSTMsentence
encode (s, h t−1 (enc)).
s
(5.20)

Representation h send D computed at the final time step is used to represent the entire
document: d = h send D .
Thus one LSTM operates at the token level, leading to the acquisition of sentence-
level representations that are then used as inputs into the second LSTM that acquires
document-level representations, in a hierarchical structure.
As with encoding, the decoding algorithm operates on a hierarchical structure with
two layers of LSTMs. LSTM outputs at sentence level for time step t are obtained
by
h st (dec) = LSTMsentence
decode (st , h t−1 (dec)).
s
(5.21)

The initial time step h s0 (d) = e D , the end-to-end output from the encoding proce-
dure h st (d) is used as the original input into LSTMwor d
decode for subsequently predicting
tokens within sentence t + 1. LSTMdecode predicts tokens at each position sequen-
wor d

tially, the embedding of which is then combined with earlier hidden vectors for the
next time-step prediction until the ends token is predicted. The procedure can be
summarized as follows:

h wt (dec) = LSTMsentence
decode (wt , h t−1 (dec)),
w
(5.22)

P(w|·) = Softmax(w, h wt−1 (dec)). (5.23)


106 5 Document Representation

food any find didn’t she . hungry was mary

Decode -Word

Decode-Sentence

Encode-Sentence

Encode -Word

Mary was hungry . she didn’t find any food

Fig. 5.7 The architecture of hierarchical document autoencoder with attentions

During decoding, LSTMwor d


decode generates each word token w sequentially and
combines it with earlier LSTM-outputted hidden vectors. The LSTM hidden vector
computed at the final time step is used to represent the current sentence.
This is passed to LSTMsentence s
decode , combined with h t for the acquisition of h t+1 ,
and outputted to the next time step in sentence decoding. For each time step t,
LSTMsentence
decode has to first decide whether decoding should proceed or come to a full
stop: we add an additional token end D to the vocabulary. Decoding terminates when
token end D is predicted. Details are shown in Fig. 5.6.
Attention models adopt a look-back strategy by linking the current decoding
stage with input sentences in an attempt to consider which part of the input is most
responsible for the current decoding state (Fig. 5.7).
Let H = {h s1 (e), h s2 (e), . . . , h sN (e)} be the collection of sentence-level hidden
vectors for each sentence from the inputs, outputted from LSTMsentence encode . Each ele-
ment in H contains information about input sequences with a strong focus on the
parts surrounding each specific sentence (time step). During decoding, suppose that
ets denotes the sentence-level embedding at current step and that h st−1 (dec) denotes
the hidden vector outputted from LSTMsentence decode at previous time step t − 1. Atten-
tion models would first link the current-step decoding information, i.e., h st−1 (dec)
which is outputted from LSTMsentence dec with each of the input sentences i ∈ [1, N ],
characterized by a strength indicator vi :

vi = U f (W1 · h st−1 (dec) + W2 · h is (enc)), (5.24)


5.4 Distributed Document Representation 107

where W1 , W2 ∈ R K ×K , U ∈ R K ×1 . vi is then normalized

exp(vi )
αi =  . (5.25)
j exp(v j )

The attention vector is then created by averaging weights over all input sentences:


ND
mt = αi h is (enc) (5.26)
i=1

5.5 Applications

In this section, we will introduce several applications on document level analysis


based on representation learning.

5.5.1 Neural Information Retrieval

Information retrieval aims to obtain relevant resources from a large-scale collection


of information resources. As shown in Fig. 5.8, given the query “Steve Jobs” as input,
the search engine (a typical application of information retrieval) provides relevant
web pages for users. Traditional information retrieval data consists of search queries
and document collections D. And the ground truth is available through explicit human
judgments or implicit user behavior data such as click-through rate.

Fig. 5.8 An example of information retrieval


108 5 Document Representation

For the given query q and document d, traditional information retrieval models
estimate their relevance through lexical matches. Neural information retrieval mod-
els pay more attention to garner the query and document relevance from semantic
matches. Both lexical and semantic matches are essential for neural information
retrieval. Thriving from neural network black magic, it helps information retrieval
models catch more sophisticated matching features and have achieved the state of
the art in the information retrieval task [17].
Current neural ranking models can be categorized into two groups: representation-
based and interaction-based [23]. The earlier works mainly focus on representation-
based models. They learn good representations and match them in the learned repre-
sentation space of queries and documents. Interaction-based methods, on the other
hand, model the query-document matches from the interactions of their terms.

[Link] Representation-Based Neural Ranking Models

The representation-based methods directly match the query and documents by learn-
ing two distributed representations, respectively, and then compute the matching
score based on the similarity between them. In recent years, several deep neural
models have been explored based on such Siamese architecture, which can be done
by feedforward layers, convolutional neural networks, or recurrent neural networks.
Reference [26] proposes Deep Structured Semantic Models (DSSM) first to hash
words to the letter-trigram-based representation. And then use a multilayer fully
connected neural network to encode a query (or a document) as a vector. The rel-
evance between the query and document can be simply calculated with the cosine
similarity. Reference [26] trains the model by minimizing the cross-entropy loss on
click-through data where each training sample consists of a query q, a positive doc-
ument d + , and a uniformly sampled negative document set D − :
 +

+ − er ·cos(q,d )
L DSS M (q, d , D ) = − log  r ·cos(q,d)
, (5.27)
d∈D e

where D = d + ∪ D − .
Furthermore, CDSSM [54] and ARC-I [25] utilize convolutional neural network
(CNN), while LSTM-RNN [44] adopts recurrent neural network with Long Short-
Term Memory (LSTM) units to represent a sentence better. Reference [53] also comes
up with a more sophisticated similarity function by leveraging additional layers of
the neural network.

[Link] Interaction-Based Neural Ranking Models

The interaction-based neural ranking models learn word-level interaction patterns


from query-document pairs, as shown in Fig. 5.9. And they provide an opportunity to
compare different parts of the query with different parts of the document individually
5.5 Applications 109

document
query

neural network

interaction matrix

Fig. 5.9 The architecture of interaction-based neural ranking models

and aggregate the partial evidence of relevance. ARC-II [25] and MatchPyra-
mind [45] utilize convolutional neural network to capture complicated patterns from
word-level interactions. The Deep Relevance Matching Model (DRMM) uses pyra-
mid pooling (histogram) to summarize the word-level similarities into ranking mod-
els [23]. There are also some works establishing position-dependent interactions for
ranking models [27, 46].
Kernel-based Neural Ranking Model (K-NRM) [66] and its convolutional version
Conv-KNRM [17] achieve the state of the art in neural information retrieval. K-NRM
first establishes a translation matrix M in which each element Mi j is the cosine
similarity of ith word in q and jth word in d. Then K-NRM utilizes kernels to
convert translation matrix M to ranking features φ(M) :


n
φ(M) = log K(Mi ), (5.28)
i=1

K(Mi ) = {K 1 (Mi ), . . . , K K (Mi )}. (5.29)

Each RBF kernel K k calculates how word pair similarities are distributed:
  
(Mi j − μk )2
K k (Mi ) = exp − . (5.30)
j
2σ K2

Then the relevance of q and d is calculated by a ranking layer:

f (q, d) = tanh(w φ(M) + b), (5.31)

where w and b are trainable parameters.


110 5 Document Representation

Reference [66] trains the model by minimizing pair-wise loss on click-through


data:
 
L = max(0, 1 − f (q, d + ) + f (q, d − )). (5.32)
q d + ,d − ∈D +,−

For the given query q, D +,− are the pair-wise preferences from the ground truth.
d and d − are two documents such that d + is more relevant with q than d − . Conv-
+

KNRM extends K-NRM to model n-gram semantic matches based on the convolu-
tional neural network which can leverage snippet information.

[Link] Summary

Representation-based models and interaction-based models extract match features


from overall and local aspects, respectively. They can also be combined for further
improvements [40].
Recently, large-scale knowledge graphs such as DBpedia, Yago, and Freebase
have emerged. Knowledge graphs contain human knowledge about real-world enti-
ties and become an opportunity for search systems to understand queries and doc-
uments better. The emergence of large-scale knowledge graphs has motivated the
development of entity-oriented search, which brings in entities and semantics from
the knowledge graphs and has dramatically improved the effectiveness of feature-
based search systems.
Entity-oriented search and neural ranking models push the boundary of match-
ing from two different perspectives. Reference [36] incorporates semantics from
knowledge graphs into the neural ranking, such as entity descriptions and entity
types. This work significantly improves the effectiveness and generalization ability
of interaction-based neural ranking models. However, how to fully leverage semi-
structured knowledge graphs and establish semantic relevance between queries and
documents remains an open question.
Information retrieval has been widely used in many natural language processing
tasks such as reading comprehension and question answering. Therefore, it is no
doubt that neural information retrieval will lead to a new tendency for these tasks.

5.5.2 Question Answering

Question Answering (QA) is one of the most important tasks and so are document-
level applications in NLP. Many efforts have been invested in QA, especially in
machine reading comprehension and open-domain QA. In this section, we will intro-
duce the advances in these two tasks, respectively.
5.5 Applications 111

[Link] Machine Reading Comprehension

As shown in Fig. 5.10, machine reading comprehension aims to determine the answer
a to the question q given a passage p. The task could be viewed as a supervised
learning problem: given a collection of training examples {( pi , qi , ai )}i=1
n
, we want
to learn a mapping f (·) that takes the passage pi and corresponding question qi as
inputs and outputs âi , where evaluate(âi , ai ) is maximized. The evaluation metric is
typically correlated with the answer type, which will be discussed in the following.
Generally, the current machine reading comprehension task could be divided into
four categories depending on the answer types according to [10], i.e., cloze style,
multiple choices, span prediction, and free-form answer.
The cloze style task such as CNN/Daily Mail [24] consists of fill-in-the-blank
sentences where the question contains a placeholder to be filled in. The answer a is
either chosen from a predefined candidate set |A| or from the vocabulary |V |. The
multiple-choice task such as RACE [30] and MCTest [50] aims to select the best
answer from a set of answer choices. It is typical to use accuracy to measure the
performance on these two tasks: the percentage of correctly answered questions in
the whole example set, since the question could be either correctly answered or not
from the given hypothesized answer set.

Fig. 5.10 An example of machine reading comprehension from SQuAD [49]


112 5 Document Representation

The span prediction task such as SQuAD [49] is perhaps the most widely adopted
task among all, since it takes compromises between flexibility and simplicity. The
task is to extract a most likely text span from the passage as the answer to the question,
which is usually modeled as predicting the start position idxstar t and end position
idxend of the answer span. To evaluate the predicted answer span â, we typically use
two evaluation metrics proposed by [49]. Exact match assigns full score 1.0 to the
predicted answer span â if it exactly equals the ground truth answer a, otherwise 0.0.
F1 score measures the degree of overlap between â and a by computing a harmonic
mean of the precision and recall.
The free-form answer task such as MS MARCO [43] does not restrict the answer
form or length and is also referred to as generative question answering. It is practical
to model the task as a sequence generation problem, where the discrete token-level
prediction was made. Currently, a consensus on what is the ideal evaluation metrics
has not been achieved. It is common to adopt standard metrics in machine translation
and summarization, including ROUGE [34] and BLEU [57].
As a critical component in the question answering system, the surging neural-
based machine reading comprehension models have greatly boosted the task of
question answering in the last decades.
The first attempt [24] to apply neural networks on machine reading comprehension
constructs bidirectional LSTM reader models along with attention mechanisms. The
work introduces two reader models, i.e., the attentive reader and the impatient reader,
as shown in Fig. 5.11. After encoding the passage and the query into hidden states
using LSTMs, the attentive reader computes a scalar distribution s(t) over the passage
tokens and uses it to compute the weighted sum of the passage hidden states r . The
impatient reader extends this idea further by recurrently updating the weighted sum
of passage hidden states after it has seen each query token.
The attention mechanisms used in reading comprehension could be viewed as
a variant of Memory Networks [64]. Memory Networks use long-term memory
units to store information for inference dynamically. Typically, given an input x,

g r r r
g
r u
u
s(1)y(1) s(4)y(4)
s(2)y(2) s(3)y(3)

Mary went to England visited England


Mary went to England visited England

(a) attentive reader (b) impatient reader

Fig. 5.11 The architecture of bidirectional LSTM reader model


5.5 Applications 113

the model first converts it into an internal feature representation F(x). Then, the
model can update the designated memory units m i given the new input: m i =
g(m i , F(x), m), or generate output features o given the new input and the mem-
ory states: o = f (F(x), m). Finally, the model converts the output into the response
with the desired format: r = R(o). The key takeaway of Memory Networks is the
retaining and updating of some internal memories that captivate global information.
We will see how this idea is further extended in some sophisticated models.
It is no doubt that the application of attention to machine reading comprehension
greatly promotes researches in this field. Following [11], the work [24] modifies the
method to compute attention and simplify the prediction layer in the attentive reader.
Instead of using tanh(·) to compute the relevance between the passage representa-
tions {p̃i }i=1
n
and the query hidden state q (see Eq. 5.33), Chen et al. use the bilinear
terms to directly capture the passage-query alignment (see Eq. 5.34).

αi = Softmaxi (tanh(W1 p̃i + W2 q)), (5.33)

αi = Softmaxi (q W3 p̃i ). (5.34)

Most machine reading comprehension models follow the same paradigm to locate
the start and endpoint of the answer span. As shown in Fig. 5.12, while encoding the
passage, the model retains the length of the sequence and encodes the question into
a fixed-length hidden representation q. The question’s hidden vector is then used
as a pointer to scan over the passage representation {pi }i=1
n
and compute scores
on every position in the passage. While maintaining this similar architecture, most
machine reading comprehension models vary in the interaction methods between the
passage and the question. In the following, we will introduce several classic reading
comprehension architectures that follow this paradigm.

Fig. 5.12 The architecture q


of classic machine reading
comprehension models

p1 p2 p3 pn-1 pn


114 5 Document Representation

First, we introduce BiDAF, which is short for Bi-Directional Attention Flow [52].
The BiDAF network consists of the token embedding layer, the contextual embedding
layer, the bi-directional attention flow layer, the LSTM modeling layer, and the
softmax output layer, as shown in Fig. 5.13.
The token embedding layer consists of two levels. First, the character embedding
layer encodes each word in character level by adopting a 1D convolutional neural
network (CNN). Specifically, for each word, characters are embedded into fixed-
length vectors, which are considered as 1D input for CNNs. The outputs are then
max-pooled along the embedding dimension to obtain a single fixed-length vector.
Second, the word embedding layer uses pretrained word vectors, i.e., GloVe [47], to
map each word into a high-dimensional vector directly.
Then the concatenation of the two vectors is fed into a two-layer Highway Network
[56]. Equation 5.35 shows one layer of the highway network used in the paper, where
H1 (·) and H2 (·) represent two affine transformations:

g = Sigmoid(H1 (x)), (5.35)


y=g ReLU(H2 (x)) + (1 − g) x. (5.36)

Fig. 5.13 The architecture of BiDAF model


5.5 Applications 115

After feeding the context and the query to the token embedding layer, we obtain
X ∈ Rd×T for the context and Q ∈ Rd×J for the query, respectively. Afterward, the
contextual embedding layer, which is a bidirectional LSTM, model the temporal
interaction between words for both the context and the query.
Then, come to the attention flow layer. In this layer, the attention dependency
is computed in both directions, i.e., the context-to-query (C2Q) attention and the
query-to-context (Q2C) attention. For both kinds of attention, we first compute a
similarity matrix S ∈ RT ×J using the contextual embeddings of the context H and
the query U obtained from the last layer (Eq. 5.37). In the equation, α(·) computes
the scalar similarity of the given two vectors and m is a trainable weight vector.

St j = α(H:,t , U:, j ) (5.37)



α(h, u) = m [h; u; h u], (5.38)

where indicates element-wise product.


For the C2Q attention, a weighted sum of contextual query embeddings is
computed given each context word. The attention distribution over the query is
obtained by a j = Softmax(S j,: ) ∈ R J . The final attended query vector is therefore

Ũ:,t = j at j U:, j for each context word.
For the Q2C attention, the context embeddings are merged into a single fixed
length hidden vector h̃. The attention
 distribution over the context is computed by
bt = Softmax(max j St j ), and h̃ = t bt H:,t . Lastly, the merged context embeddings
are tiled T times along the column to produce H̃.
Finally, the attended outputs are combined to yield G, which is defined by Eq. 5.39

G:,t = φ(H:,t , Ũ:,t , H̃:,t ) (5.39)


β(h, ũ, h̃) = [h; ũ; h ũ; h h̃]. (5.40)

Afterward, the LSTM modeling layer takes G as input and encodes it using a
two-layer bidirectional LSTM. The output M ∈ R2d×T is combined with G to yield
the final start and end probability distributions over the passage.

P 1 = Softmax(u1 [G; M]), (5.41)


P =
2
Softmax(u2 [G; LSTM(M)]), (5.42)

where u1 , u2 are two trainable weight vectors.


To train the model, the negative log likelihood loss is adopted and the goal is
to maximize the probability of the golden start index idxstar t and end index idxend
being selected by the model,
116 5 Document Representation

1 
N
L =− log(Pid1 x i ) + log(Pid2 x i ) . (5.43)
N i=1 star t star t

Besides BiDAF, where attention dependencies are computed in two directions,


we will also briefly introduce other interaction methods between the query and the
passage. The Gated-Attention Reader proposed by [19] adopts the gated attention
module, where each token representation of the passage di is scaled by the attended
query vector Q after each Bi-GRU layer (Eq. 5.44).

αi = Softmax(Q di ) (5.44)
q̃i = Qαi (5.45)
xi = di q̃i . (5.46)

This gated attention mechanism allows the query to directly interact with the token
embeddings of the passage at the semantic level. And such layer-wise interaction
enables the model to learn conditional token representation given the question at
different representation levels.
The Attention-over-Attention Reader [16] takes another path to model the inter-
action. The attention-over-attention mechanism involves calculating the attention
between the passage attention α(t) and the averaged question attention β after obtain-
ing the similarity matrix M ∈ Rn×m (Eq. 5.47). This operation is considered to learn
the contributions of individual question words explicitly.

α(t) = Softmax(M:,t ),
1 
N
β= Softmax(Mt,: ). (5.47)
N t=1

[Link] Open-Domain Question Answering

Open-domain QA (OpenQA) has been first proposed by [21]. The task aims to
answer open-domain questions using external resources such as collections of docu-
ments [58], web pages [14, 29], structured knowledge graphs [3, 7] or automatically
extracted relational triples [20].
Recently, with the development of machine reading comprehension techniques
[11, 16, 19, 55, 63], researchers attempt to answer open-domain questions via per-
forming reading comprehension on plain texts. Reference [12] proposes to employ
neural-based models to answer open-domain questions. As illustrated in Fig. 5.14,
neural-based OpenQA system usually retrieves relevant texts of the question from a
large-scale corpus and then extracts answers from these texts using reading compre-
hension models.
5.5 Applications 117

Q: Who is the CEO of apple


in 2020?
P

Document Document
Retriever Reader
Tim Cook
Answer

Fig. 5.14 An example of open-domain question answering

The DrQA system consists of two components: (1) The document retriever module
for finding relevant articles and (2) the document reader model for extracting answers
from given contexts.
The document retriever is used as a first quick skim to narrow the searching space
and focus on documents that are likely to be relevant. The retriever builds TF-IDF
weighted bag-of-words vectors for the documents and the questions, and computes
similarity scores for ranking. To further utilize local word order information, the
retriever uses bigram counts with hash while preserving both the speed and memory
efficiency.
The document reader model takes in the top 5 Wikipedia articles yielded by the
document retriever and extracts the final answer to the question. For each article, the
document reader predicts an answer span with a confidence score. The final prediction
is made by maximizing the unnormalized exponential of prediction scores across the
documents.
Given each document d, the document reader first builds feature representation
d̃i for each word in the document. The feature representation d̃ is made up by the
following components.

1. Word embeddings: The word embeddings f emb (d) are obtained from large-scale
GloVe embeddings pretrained on Wikipedia.
2. Manual features: The manual features f token (d) combined part-of-speech (POS)
and named entity recognition tags and normalized Term Frequencies (TF).
3. Exact match: This feature indicates whether di can be exactly matched to one
question word in q.
4. Aligned question embeddings: This feature aims to encode a soft alignment
between words in the document and the question in the word embedding space.
118 5 Document Representation

f align (di ) = αi j E(q j ) (5.48)
j

exp(MLP(E(di )) MLP(E(q j )))


αi j =  
(5.49)
j  exp(MLP(E(di )) MLP(E(q j  )))

where MLP(x) = max(0, Wx + b) and E(q j ) indicates the word embedding of


the jth word in the question.

Finally, the feature representation is obtained by concatenating the above features:

d̃i = ( f emb (di ), f token (di ), f exact_match (di ), f align (di )). (5.50)

Then the feature representation of the document is fed into a multilayer bidirec-
tional LSTM (BiLSTM) to encode the contextual representation.

d1 , . . . , dn = BiLSTM(d̃1 , . . . , d̃n ). (5.51)

For the question, the contextual representation is simply obtained by encoding


the word embeddings using a multilayer BiLSTM.

q1 , . . . , qm = BiLSTM(q̃1 , . . . , q̃m ) (5.52)

After that, the contextual representation is aggregated into a fixed-length vector


using self-attention.

exp(u q j )
bj =  
(5.53)
j  exp(u q j  )

q= bjqj. (5.54)
j

In the answer prediction phase, the start and end probability distributions are
calculated following the paradigm mentioned in the Reading Comprehension Model
section (Sect. [Link]).

exp(di Wstar t q)
P star t (i) =   star t q)
(5.55)
i  exp(di  W
exp(di Wend q)
P end (i) =   end q)
. (5.56)
i  exp(di  W

Despite its success, the DrQA system is prone to noise in retrieved texts which may
hurt the performance of the system. Hence, [15] and [61] attempt to solve the noise
5.5 Applications 119

problem in DrQA via separating the question answering into paragraph selection and
answer extraction, and they both only select the most relevant paragraph among all
retrieved paragraphs to extract answers. They lose a large amount of rich information
contained in those neglected paragraphs. Hence, [62] proposes strength-based and
coverage-based re-ranking approaches, which can aggregate the results extracted
from each paragraph by the existing DS-QA system to determine the answer better.
However, the method relies on the pre-extracted answers of existing DS-QA models
and still suffers the noise issue in distant supervision data because it considers all
retrieved paragraphs indiscriminately. To address this issue, [35] proposes a coarse-
to-fine denoising OpenQA model, which employs a paragraph selector to filter out
paragraphs and a paragraph reader to extract the correct answer from those denoised
paragraphs.

5.6 Summary

In this chapter, we have introduced document representation learning, which encodes


the semantic information of the whole document into a real-valued representation
vector, providing an effective way of downstream tasks utilizing the document infor-
mation and has significantly improved the performances of these tasks.
First, we introduce the one-hot representation for documents. Next, we exten-
sively introduce topic models to represent both words and documents using latent
topic distribution. Further, we give an introduction on distributed document repre-
sentation including paragraph vector and neural document representations. Finally,
we introduce several typical real-world applications of document representations,
including information retrieval and question answering.
In the future, for better document representation, some directions are requiring
further efforts:
(1) Incorporating External Knowledge. Current document representation
approaches focus on representing documents with the semantic information of
the whole document text. Moreover, knowledge bases provide external semantic
information to better understand the real-world entities in the given document.
Researchers have formed a consensus that incorporating entity semantics of
knowledge bases into document representation is a potential way toward better
document representation. Some existing work leverages various entity semantics
to enhance the semantic information of document representation and achieves
better performance in multiple applications such as document ranking [36, 65].
Explicitly modeling structural and textual semantic information as well as con-
sidering the entity importance for the given document also share some lights for
a more interpretable and knowledgable document representation for downstream
NLP tasks.
(2) Considering Document Interactions. The candidate documents in downstream
NLP tasks are usually relevant to each other and may help for better modeling
120 5 Document Representation

document semantic information. There is no doubt that the interactions among


documents, no matter with implicit semantic relations or with explicit links, will
provide additional semantic signals to enhance the document representations.
Reference [32] preliminarily uses document interactions to extract important
words and improve model performance. Nevertheless, it remains an unsolved
problem of how to effectively and explicitly incorporate semantic information
into document representations from other documents.
(3) Pretraining for Document Representation. Pretraining has shown effective-
ness and thrives on downstream NLP tasks. Existing pre-trained language models
such as Word2vec style word co-occurrence models [38] and BERT style mask
language models [18, 48] focus on the representation learning at the sentence
level, which cannot work well for document-level representation. It is still chal-
lenging to model cross-sentence relations, text coherence, and co-reference at
the document level in document representation learning. Moreover, there are also
some methods that leverage useful signals such as anchor-document information
to supervise document representation learning [67]. How to pretrain document
representation models with efficient and effective strategies is still a critical and
challenging problem.

References

1. Amr Ahmed, Mohamed Aly, Joseph Gonzalez, Shravan Narayanamurthy, and Alexander
Smola. Scalable inference in latent variable models. In Proceedings of WSDM, 2012.
2. Arthur Asuncion, Max Welling, Padhraic Smyth, and Yee Whye Teh. On smoothing and infer-
ence for topic models. In Proceedings of UAI, 2009.
3. Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase
from question-answer pairs. In Proceedings of EMNLP, 2013.
4. David M Blei, Thomas L Griffiths, and Michael I Jordan. The nested chinese restaurant process
and bayesian nonparametric inference of topic hierarchies. The Journal of the ACM, 57(2):7,
2010.
5. David M Blei and John D Lafferty. Dynamic topic models. In Proceedings of ICML, 2006.
6. David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of
Machine Learning Research, 3:993–1022, 2003.
7. Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. Large-scale simple ques-
tion answering with memory networks. arXiv preprint arXiv:1506.02075, 2015.
8. Jordan L Boyd-Graber and David M Blei. Syntactic topic models. In Proceedings of NeurIPS,
2009.
9. Jonathan Chang and David M Blei. Hierarchical relational models for document networks. The
Annals of Applied Statistics, pages 124–150, 2010.
10. Danqi Chen. Neural Reading Comprehension and Beyond. PhD thesis, Stanford University,
2018.
11. Danqi Chen, Jason Bolton, and Christopher D. Manning. A thorough examination of the
cnn/daily mail reading comprehension task. In Proceedings of ACL, 2016.
12. Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading wikipedia to answer
open-domain questions. In Proceedings of the ACL, 2017.
13. Jianfei Chen, Kaiwei Li, Jun Zhu, and Wenguang Chen. Warplda: a cache efficient o (1)
algorithm for latent dirichlet allocation. Proceedings of VLDB, 2016.
14. Tongfei Chen and Benjamin Van Durme. Discriminative information retrieval for question
answering sentence selection. In Proceedings of EACL, 2017.
References 121

15. Eunsol Choi, Daniel Hewlett, Jakob Uszkoreit, Illia Polosukhin, Alexandre Lacoste, and
Jonathan Berant. Coarse-to-fine question answering for long documents. In Proceedings of
ACL, 2017.
16. Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang, Ting Liu, and Guoping Hu. Attention-over-
attention neural networks for reading comprehension. In Proceedings of ACL, 2017.
17. Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. Convolutional neural networks
for soft-matching n-grams in ad-hoc search. In Proceedings of WSDM, 2018.
18. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of
deep bidirectional transformers for language understanding. In Proceedings of NAACL, 2019.
19. Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang, William Cohen, and Ruslan Salakhutdinov. Gated-
attention readers for text comprehension. In Proceedings of ACL, 2017.
20. Anthony Fader, Luke Zettlemoyer, and Oren Etzioni. Open question answering over curated
and extracted knowledge bases. In Proceedings of SIGKDD, 2014.
21. Bert F Green Jr, Alice K Wolf, Carol Chomsky, and Kenneth Laughery. Baseball: an automatic
question-answerer. In Proceedings of IRE-AIEE-ACM, 1961.
22. Thomas L Griffiths, Mark Steyvers, David M Blei, and Joshua B Tenenbaum. Integrating topics
and syntax. In Proceedings of NeurIPS, 2004.
23. Jiafeng Guo, Yixing Fan, Qingyao Ai, and [Link] Croft. A deep relevance matching model
for ad-hoc retrieval. In Proceedings of CIKM, 2016.
24. Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa
Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. In Proceedings of
NeurIPS, 2015.
25. Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. Convolutional neural network archi-
tectures for matching natural language sentences. In Proceedings of NeurIPS, 2014.
26. Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning
deep structured semantic models for web search using clickthrough data. In Proceedings of
CIKM, 2013.
27. Kai Hui, Andrew Yates, Klaus Berberich, and Gerard de Melo. Pacrr: A position-aware neural
ir model for relevance matching. In Proceedings of EMNLP, 2017.
28. Yangfeng Ji, Trevor Cohn, Lingpeng Kong, Chris Dyer, and Jacob Eisenstein. Document context
language models. arXiv preprint arXiv:1511.03962, 2015.
29. Cody Kwok, Oren Etzioni, and Daniel S Weld. Scaling question answering to the web. TOIS,
pages 242–262, 2001.
30. Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE: Large-scale
reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017.
31. Quoc V Le and Tomas Mikolov. Distributed representations of sentences and documents. In
Proceedings of ICML, 2014.
32. Canjia Li, Yingfei Sun, Ben He, Le Wang, Kai Hui, Andrew Yates, Le Sun, and Jungang
Xu. Nprf: A neural pseudo relevance feedback framework for ad-hoc information retrieval. In
Proceedings of EMNLP, 2018.
33. Jiwei Li, Thang Luong, and Dan Jurafsky. A hierarchical neural autoencoder for paragraphs
and documents. In Proceedings of ACL, 2015.
34. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. Text Summarization
Branches Out, 2004.
35. Yankai Lin, Haozhe Ji, Zhiyuan Liu, and Maosong Sun. Denoising distantly supervised open-
domain question answering. In Proceedings of ACL, 2018.
36. Zhenghao Liu, Chenyan Xiong, Maosong Sun, and Zhiyuan Liu. Entity-duet neural ranking:
Understanding the role of knowledge graph semantics in neural information retrieval. In Pro-
ceedings of ACL, 2018.
37. Jon D Mcauliffe and David M Blei. Supervised topic models. In Proceedings of NeurIPS, 2008.
38. T Mikolov and J Dean. Distributed representations of words and phrases and their composi-
tionality. Proceedings of NeurIPS, 2013.
39. David Mimno and Andrew McCallum. Topic models conditioned on arbitrary features with
dirichlet-multinomial regression. In Proceedings of UAI, 2008.
122 5 Document Representation

40. Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed
representations of text for web search. In Proceedings of WWW, 2017.
41. David Newman, Arthur U Asuncion, Padhraic Smyth, and Max Welling. Distributed inference
for latent dirichlet allocation. In Proceedings of NeurIPS, 2007.
42. David Newman, Chaitanya Chemudugunta, and Padhraic Smyth. Statistical entity-topic mod-
els. In Proceedings of SIGKDD, 2006.
43. Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and
Li Deng. MS MARCO: A human generated machine reading comprehension dataset. arXiv
preprint arXiv:1611.09268, 2016.
44. Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, Jianshu Chen, Xinying
Song, and Rabab Ward. Deep sentence embedding using long short-term memory networks:
Analysis and application to information retrieval. IEEE/ACM Transactions on Audio, Speech
and Language Processing, 24(4):694–707, 2016.
45. Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng. Text match-
ing as image recognition. In Proceedings of AAAI, 2016.
46. Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Jingfang Xu, and Xueqi Cheng. Deeprank: A
new deep architecture for relevance ranking in information retrieval. In Proceedings of CIKM,
2017.
47. Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word
representation. In Proceedings of EMNLP, 2014.
48. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of NAACL-
HLT, pages 2227–2237, 2018.
49. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ ques-
tions for machine comprehension of text. In Proceedings of EMNLP, 2016.
50. Matthew Richardson, Christopher JC Burges, and Erin Renshaw. MCTest: A challenge dataset
for the open-domain machine comprehension of text. In Proceedings of EMNLP, 2013.
51. Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. The author-topic
model for authors and documents. In Proceedings of UAI, 2004.
52. Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional atten-
tion flow for machine comprehension. In Proceedings of ICLR, 2017.
53. Aliaksei Severyn and Alessandro Moschitti. Learning to rank short text pairs with convolutional
deep neural networks. In Proceedings of SIGIR, 2015.
54. Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. A latent semantic
model with convolutional-pooling structure for information retrieval. In Proceedings of CIKM,
2014.
55. Yelong Shen, Po-Sen Huang, Jianfeng Gao, and Weizhu Chen. Reasonet: Learning to stop
reading in machine comprehension. In Proceedings of SIGKDD, 2017.
56. Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. arXiv
preprint arXiv:1505.00387, 2015.
57. A Cuneyd Tantug, Kemal Oflazer, and Ilknur Durgar El-Kahlout. Bleu+: a tool for fine-grained
bleu computation. 2008.
58. Ellen M Voorhees et al. The trec-8 question answering track report. In Proceedings of TREC,
1999.
59. Hanna M Wallach. Topic modeling: beyond bag-of-words. In Proceedings of ICML, 2006.
60. Chong Wang, Bo Thiesson, Chris Meek, and David Blei. Markov topic models. In Proceedings
of AISTATS, 2009.
61. Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, Shiyu Chang,
Gerald Tesauro, Bowen Zhou, and Jing Jiang. R3: Reinforced ranker-reader for open-domain
question answering. In Proceedings of AAAI, 2018.
62. Shuohang Wang, Mo Yu, Jing Jiang, Wei Zhang, Xiaoxiao Guo, Shiyu Chang, Zhiguo Wang,
Tim Klinger, Gerald Tesauro, and Murray Campbell. Evidence aggregation for answer re-
ranking in open-domain question answering. In Proceedings of ICLR, 2018.
References 123

63. Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. Gated self-matching
networks for reading comprehension and question answering. In Proceedings of ACL, 2017.
64. Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. arXiv
preprint arXiv:1410.3916, 2014.
65. Chenyan Xiong, Jamie Callan, and Tie-Yan Liu. Word-entity duet representations for document
ranking. In Proceedings of SIGIR, 2017.
66. Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. End-to-end neural
ad-hoc ranking with kernel pooling. In Proceedings of SIGIR, 2017.
67. Kaitao Zhang, Chenyan Xiong, Zhenghao Liu, and Zhiyuan Liu. Selective weak supervision
for neural information retrieval. arXiv preprint arXiv:2001.10382, 2020.

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License ([Link] which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.

You might also like