Explainable Word Sense Networks
Explainable Word Sense Networks
Table 1: Part of content for the word “bass” in the proposed Oxford dataset.
Attribute Theirs Ours To be more specific, our dataset provides the following
#Words 36,767 31,798 guarantees:
Avg. #sentence per def. 1 27 • Each example sentence contains the target word it defines.
POS tag N Y • We include all example sentences of a specific definition
Total sentences 122,319 1,299,821 available in the online dictionary.
Total tokens 3,516,066 18,484,401 • We also include the corresponding POS tag of each word
sense for further research usage.
Table 2: The dataset comparison between the prior The statistics of the proposed dataset is summarized in Ta-
work (Gadetsky, Yakubovskiy, and Vetrov 2018) and the ble 2, where it is obvious that our dataset contains much
proposed one. more example sentences, and the size is about 5 times larger
than one provided by Gadetsky, Yakubovskiy, and Vetrov.
The high-quality and rich dataset can be leveraged in differ-
all three issues together. The contributions of this paper are ent NLP tasks, and this paper utilizes it for learning explain-
4-fold: able word sense networks, xSense.
• Given a (context, word) pair, this paper can explicitly pin
down the dimension in the sparse word representation that xSense: Explainable Word Sense Networks
represents the sense of the word under a given context.
• This paper is able to interpret the value of a specific di- The proposed model, xSense, consists of four main modules
mension in the transformed sparse representation. as illustrated in Figure 1. Given a target word and its con-
• This paper provides the human understandable textual text, the model encodes the contexts (context encoder) and
definition for a particular sense of a word embedding extracts its sparse representation (sparse vector extractor).
given its context. A mask is generated (mask generator) based on the con-
• We release a large and high-quality context-definition texts and the sparse vector in order to find the dimensions
dataset that consists of abundant example sentences and that encode the corresponding sense information, and then
the corresponding definitions for a variety of words. a definition sentence is generated (definition decoder). Each
component is detailed below.
Dictionary Corpus Dual Vocabularies
Dictionary corpora are usually available in the online elec- We propose dual vocabularies, Vw2v and Vdec , used in our
tronic format. However, they are often lack of example sen- model. The first one is the pretrained embeddings from
tences. To the best of our knowledge, the Oxford online dic- word2vec2 used by the encoder and sparse vector extrac-
tionary is the only one that contains an abundant amount tor. The second one is randomly initialized and is only used
of example sentences.1 The prior work recently released a by the decoder. The goal of using two sets of vocabularies
dataset based on this resource (Gadetsky, Yakubovskiy, and is to lower the out-of-vocabulary (OOV) rate. To be more
Vetrov 2018). However, their dataset does not contain com- specific, while Vw2v contains lots of tokens, it misses some
plete information achievable online, which hinders the usage common functional words such as ‘a’ and ‘of’. In order to
for diverse tasks. Some findings are described here: 1) Their generate such common words in definition sentences, the
dataset only provided one single example sentence for a defi- dedicated vocabulary Vdec is adopted.
nition, while there are usually multiple ones online. 2) Some
provided example sentences in the dataset do not contain the Context Encoder
target word, making the usage difficult. 3) Some provided
example sentences do not align with their target word and Given a context, the encoder module generates a distinguish-
the associated definitions. Considering the quality of the re- able and meaningful sentence embedding. Because we do
leased dataset, this paper addresses these problems from the not assume additional resource for training the sentence em-
prior work by releasing the newly-collected dataset and the bedding, the sentence is encoded in an unsupervised man-
toolkit for crawling the content. A word example along with ner, which can be obtained in either sophisticated neural-
its multiple definitions and associated example sentences are based (Kiros et al. 2015) or weighted-sum-based (Arora,
shown in Table 1. Liang, and Ma 2016) methods. The latter method is chosen
1 2
https://en.oxforddictionaries.com/ https://code.google.com/archive/p/word2vec/
Mask Generator Definition Decoder
Dot Product as Attention Weights Definition … edible fruit …
Top K Rows in 𝑾 𝑦1 𝑦𝑡−1 𝑦𝑡
Aligned
Contexts Target Word 𝑦1 𝑦𝑡−1 𝑦𝑡 EOS
𝛼𝑖 Embedding 𝑣𝑤
…… + Aligned Contexts
Alignment
…… BOS
𝑆𝑖
Sense Vector 𝑚
Figure 1: Illustration of the proposed xSense model. Encoder does not have parameters to train. The sparse extractor is pretrained
and fixed during the training of mask generator and the decoder.
in this paper due to two reasons. First, neural-based meth- al. 2017):
ods require additional training data and much longer train-
ing time. Second, considering the goal of this paper is in- zw = f (Wenc vw + benc ) (2)
0
terpretability, weighted-sum method is more transparent for vw = Wdec zw + bdec (3)
humans to interpret and investigate the error.3
In our weighted-sum approach, we apply the smooth in- where f is the capped-ReLU activation function, Wenc ∈
verse frequency (SIF) embeddings (Arora, Liang, and Ma IRm×d , benc ∈ IRm , Wdec ∈ IRd×m , bdec ∈ IRd are the
2016), which is inspired by the discourse random walk learning parameters, and d is the dimension of the word em-
model (Arora et al. 2016). Formally, given word embeddings bedding.
vw : w ∈ Vw2v , a sentence s ∈ S, where S is the set of This formulation follows a regular k-sparse autoencoder
all training sentences, a smoothing parameter a, and the oc- aiming at minimizing reconstruction loss and partial spar-
currence probabilities p(w) : w ∈ Vw2v of the words derived sity loss (Makhzani and Frey 2013; Subramanian et al.
from the training corpus, SIF computes: 2017). Makhzani and Frey pointed out that the k-sparse au-
toencoder can be viewed as the variant of iterative thresh-
1 X a olding with the inversion algorithm (Maleki 2009), which
vs = vw , (1)
|s| w∈s a + p(w) aims to train an overcomplete matrix W as orthogonal as
possible. After training, W can be used as the dictionary
where |s| is the length of sentence s and vs will be used in in the sparse recovery stage. In the context of word embed-
the mask generator to generate the attention mask. dings, the matrix Wdec contains the orthogonal basis of the
embedding space, which are likely to be the basic semantic
Sparse Vector Extractor components.
Words have large values in a specific dimension of We link this observation to the discourse atom, the basic
their sparse representations often form a semantic clus- sense component (Arora et al. 2018). Arora et al. showed
ter (Faruqui et al. 2015; Subramanian et al. 2017). This char- that a set of word embeddings can be disentangled into mul-
acteristic helps interpret the semantics in different dimen- tiple discourse vectors by sparse coding. Formally, given
sions. Inspired by the idea about sparse coding in Subrama- word embeddings vw : w ∈ Vw2v in IRd and an integer
nian et al., we incorporate a sparse vector extractor to learn m d, the goal is to solve:
the sparse representation of the target word (Subramanian et 2
X m
X
3
We also tried training a bidirectional GRU encoder, the perfor- vw − αw,j Aj , (4)
mance is roughly the same. w∈Vw2v j=1
2
where αw,j represents how much the discourse vector Aj where M is the number of words in the definition. We assign
weighs in constituting vw . Both Wdec and the discourse the aligned SIF embedding Tvs to the initial hidden state of
atom are the basic semantic components of the embedding the first-layer GRU and the target word embedding vw to the
space. Moreover, from the viewpoint of matrix operation, initial state of the second-layer GRU illustrated in Figure 1:
(4) is equivalent to (3) with αw,j = zw,j and Aj = Wdec,j , h10 =Tvs , (11)
where Wdec,j is the j-th column of the matrix. In practice,
since zw is directly generated by Wenc , we use the corre- h20 =vw . (12)
sponding row vectors of Wenc in the mask generator. As The goal of using the pretrained target word embedding as
illustrated in Figure 1, the sparse vector extractor focuses on the initial hidden state is to provide explicit signal for the
decomposing different senses into different dimensions via model in order to generate coherent and consistent defini-
sparse coding, and the trained sparse encoder is for the mask tions. We also conduct the experiments using signals other
generator usage. than vw in the experiment section to analyze the effective-
ness. This initialization conditions the decoder to correctly
Mask Generator generate definitions. For each decoding step, the input to the
The mask generator module is the key for interpretability, cell is concatenated as:
which connects the encoder and the sparse extractor and xt = [vg , m], (13)
automatically finds the sense-specific dimensions. Given where vg ∈ Vdec is the ground truth word embedding at t-
the SIF embedding vs and a target word embedding vw , th timestep and m is the sense vector calculated as (9). The
we focus on extracting the sense information from vw ac- decoding process terminates when an end-of-sentence token
cording to its contexts. vw is first fed into the sparse vector is predicted. The internal structure of a GRU cell is:
extractor to produce its sparse representation zw . We then
lookup K highest values in the sparse vector and retrieve rt = σ(Wr · [ht−1 , xt ]), (14)
the corresponding vectors in Wenc , which is learned from zt = σ(Wz · [ht−1 , xt ]), (15)
the sparse vector extractor. Formally, we compute the sparse h̃t = tanh(Wh̃ · [rt ∗ ht−1 , xt ]), (16)
representation of the target word by (2) and obtain K largest
values: ht = (1 − zt ) ∗ ht−1 + zt ∗ h̃t . (17)
γ1···K = argsortK (zw ). (5) The output is generated by passing the hidden state through
We retrieve the rows of Wenc according to the indices ob- a linear layer:
tained in (5): Ot = Wo · ht . (18)
sj = Wenc [γj ], j ∈ 1 · · · K. (6) where Wo ∈ IR|Vdec |×d . We use Ot to generate the final
distribution over Vdec via softmax operation. Formally,
sj is therefore the γj -th row vector of Wenc . We calculate
the inner product between the sentence embedding vs and exp(Ot,i )
pt,i = P , (19)
the basis vectors sj to generate a weighted mask. However, j exp(Ot,j )
the direct calculation is unreasonable since they do not align yt = arg max pt,i (20)
well in the vector space. Because both vs and sj are derived i
from the same pretrained embeddings by almost-linear oper- Note that during the testing phase, the decoder is auto-
ations, we assume that learning an additional linear transfor- regressive. Formally, (13) becomes:
mation T ∈ IRd×d can effectively align the space (Conneau xt = [vyt , m]. (21)
et al. 2017). The inner product is thus calculated after the
transformation: Optimization
d1···k = Tvs sj , j ∈ 1 · · · K. (7) There are two losses for optimizing the sparse extractor,
where the first loss is the reconstruction loss:
The mask is calculated by a softmax layer: 1 X 0 2
exp(dj ) LR (D) = |vw − vw | , (22)
αj = P , j ∈ 1 · · · K. (8) |D|
w∈D
j exp(dj )
where D is the size of the whole dataset, and the second loss
Finally, the retrieved basis vectors are weighted by the mask is the partial sparsity loss (Subramanian et al. 2017):
and then the sense vector is formed: 1 XX
X LP S (D) = (vw,h (1 − vw,h )). (23)
m= αj · sj , j ∈ 1 · · · K. (9) |D|
x∈D h
j
This loss encourages every dimension h of vw to be either
Definition Decoder 0 or 1. Note that the sparse extractor module is pretrained
and fixed. In order to train the whole model in an end-to-
The decoder module generates a textual definition for a tar-
end fashion, we minimize the negative log likelihood over
get word given its context. GRU is applied as our recurrent
maximum decoding steps M :
unit (Cho et al. 2014). We denote a target definition sentence
M
as a sequence of tokens: X
LN LL = − log pt (y˜t ). (24)
ỹ = {y˜1 ,y˜2 , . . . y˜M } , (10) t=1
Datasets
Methods
Large Small Unseen
1) Baseline w/o contexts
Noraset et al. (2017) 33.8 / 36.3 30.5 / 32.7 12.0 / 13.3
2) Baseline w/ contexts
Seq2Seq 20.1 / 21.1 18.3 / 18.7 11.3 / 10.5
Gadetsky et al. (2018) 26.0 / 31.6 25.5 / 30.4 9.8 / 11.3
3) Proposed 1-Layer Init 2-Layer Init Each Time Input
SSS Sense Vector Sense Vector Sense Vector 14.8 / 17.0 14.4 / 15.9 12.1 / 13.3
AAS Aligned Contexts Aligned Contexts Sense Vector 20.6 / 23.0 18.6 / 20.3 12.4 / 13.9
xSense TTS Target Word Target Word Sense Vector 33.6 / 35.9 29.4 / 31.3 11.9 / 14.2
ATS Aligned Contexts Target Word Sense Vector 37.2 / 39.7 30.1 / 32.0 12.7 / 14.5
TAS Target Word Aligned Contexts Sense Vector 40.0 /42.6 31.9 /33.9 12.4 /13.2
Table 3: BLEU and ROUGE-L scores for baselines and various proposed architectures. (BLEU / ROUGE-L:F1).
Experiments the second one does. The baseline without contexts is es-
sentially a language model conditioning on the pretrained
To evaluate our proposed model, we conduct various sets of
word embeddings Vw2v , which shares the same architec-
experiments using our newly collected Oxford dataset.
ture in Noraset et al.. We reimplement the model and train
on our proposed dataset for fair comparison. For baselines
Setting with contexts, we train the model proposed by Gadetsky,
Hyperparameters Both Vw2v and Vdec have dimension Yakubovskiy, and Vetrov with their strongest settings on our
300. For the encoder, we fix the smoothing term a in (1) dataset and a vanilla sequence-to-sequence model with both
to 10−3 as recommended (Arora, Liang, and Ma 2016). For encoder and decoder being a two-layer GRU network.
the sparse vector extractor, the similar setup is adopted (Sub-
ramanian et al. 2017). We choose K = 5 in the mask gener- Proposed Variants We tried different input variants of
ator. The definition decoder is a two-layer GRU (Cho et al. (11), (12), and (13) to see the effectiveness of inputting the
2014) with the hidden size 300, where the optimizer used explicit signal during decoding. Specifically, for the 1-layer
is SGD with the learning rate 0.1 for training the sparse and 2-layer initialization of GRU and the additional input at
vector extractor and the mask generator. The Adam opti- each time step, different combinations of aligned contexts
mizer (Kingma and Ba 2014) with the default settings is ap- (A), the target word vector (T), and the sense vector (S) are
plied to the decoder. attempted. Note that at least one of the input should be the
sense vector (m in (9)) in order to optimize the mask gener-
Testsets In the experiments, we want to demonstrate the ator.
ability of the proposed model in two difficulty levels.
• Easy: The easier one is to test (seen words, unseen con- Results
texts). Concretely, the small testset is the one proposed The results are shown in Table 3. Among all baselines, No-
by Gadetsky, Yakubovskiy, and Vetrov with 6,809 in- raset et al.’s work is the strongest baseline even though their
stances, while the large testset is the one we collect with model generates exactly the same definition regardless of
42,589 instances. different contexts. The probable reason is that dictionary
• Hard: The harder one is to test (unseen words, unseen definitions are often written in a highly structured and sim-
contexts) in the unseen testset with 808 instances, which ilar format, thus generating the same definition for all con-
consists of all target words that are never seen during texts can still share some common words with the ground
training. truth.
Evaluation Metrics Two objective measures are reported, Among baselines leveraging contexts, the performance
including BLEU (Papineni et al. 2002) up to 4-gram and F of the sequence-to-sequence model is worse than Gadet-
measure of ROUGE-L (Lin 2004). Considering that BLEU sky, Yakubovskiy, and Vetrov’s. The probable reason is that
score has a lot of smoothing strategies, we decide to fol- Gadetsky, Yakubovskiy, and Vetrov introduced a mask to
low prior work (Noraset et al. 2017; Gadetsky, Yakubovskiy, differentiate different contexts and generate definitions ac-
and Vetrov 2018) and use the sentence-BLEU binary in the cordingly. However, their performance is the worst among
Moses library4 for a fair comparison. Both scores are aver- all models on unseen dataset, which explicitly evaluates the
aged across all testing instances. generalizability. The observation tells that the better perfor-
mance on large and small are likely because of memorizing
Baselines Two sets of baseline approaches are compared, the information from the training data (overfitting). In ad-
where the first one does not consider the contexts and dition, the performance gain compared with Noraset et al.’s
work reported in (Gadetsky, Yakubovskiy, and Vetrov 2018)
4
http://www.statmt.org/moses/ is only 0.46 of BLEU (full 100), which is insignificant.
Top 1 Ranking Score
Model
All Multi-Sense All Multi-Sense
Noraset et al. (2017) 311 (30.8%) 17 (28.4%) 1887 (27.6%) 111(27.2%)
Gadetsky et al. (2018) 240 (23.8%) 9 (15.0%) 1701 (24.9%) 92(22.5%)
xSense w/o Alignment 115 (11.4%) 8 (13.3%) 1182 (17.3%) 80 (19.9%)
xSense-ATS (Aligned Contexts/Target Word/Sense Vector) 342 (34.0%) 26(43.3%) 2055 (30.2%) 124 (30.4%)
Table 4: Ranked human evaluation results on randomly sampled 200 questions from small datasets.
To analyze the information richness of different variants, without alignment in (7) that jointly learns the sparse vec-
for two hidden layers, we replace the sense vector as initial- tor extractor. Three human annotators are recruited to rank
ization with the aligned contexts. Comparing between SSS the generated definitions given the target word and its cor-
and AAS in Table 3, using the aligned contexts (Tvs ) as the responding contexts in each sample. Table 4 shows the final
initial hidden state in the decoder outperforms the one only statistics, where the top 1 choice and the accumulated scores
using the sense vector (m). The reason is that the aligned are reported (4: first, 3: second, 2: third, 1: last). Note that
contexts provide the decoder additional information of con- in some samples, two models may generate exactly the same
texts and help generate more sophisticated definitions, while definition and if an annotator picks either of them, we assign
the sense vector is the weighted sum of basis vectors as the same score to another.
shown in (9), which may introduce some errors due to the It can be found that our model performs best among
imperfectness of the sparse vector extractor. all candidates for both settings of all target words and
We also try to replace the sense vector with pretrained tar- multi-sense target words. While Subramanian et al.’s work
get word embedding to initialize the hidden state of the de- achieves the second-best performance, it cannot distinguish
coder, and the significantly better performance is observed different senses since it does not consider the contexts,
(SSS v.s. TTS). This is reasonable because pretrained em- which makes the task about explainable embeddings en-
beddings are trained on a large corpus and thus contain ro- tirely useless. The multi-sense setting indeed shows that our
bust and rich information. In addition, it provides a static proposed model significantly outperforms theirs. The worst
representation that stabilizes the training process of the de- model is the one without alignment, indicating that the basis
coder. However, we find out that while having good perfor- vectors and the sentence embedding do not align in the vec-
mance on BLEU/ROUGE scores, the variety of generated tor space so that the attention cannot be correctly obtained.
definitions is lower than the one of the aligned contexts. In
other words, despite pretrained word embeddings being in- Qualitative Analysis
formative, its semantic meaning is likely dominated by the An important capability of our model is that we can pin
most frequent senses in the training corpus. In fact, we ob- down the dimension in the sparse representation of a target
serve that simply using the target word embedding as the word given its context. This is difficult to tell in numbers,
initial decoder hidden state cannot distinguish the difference so we show some samples for analysis in Table 5. We can
between fine-grained senses; The definitions generated by see that the nearest neighbors and the generated definitions
TTS are the major senses in most testing instances. belong to the same semantic clusters. Moreover, we are able
Finally, to balance between variety and correctness, com- to disentangle multiple senses based on the given contexts.
bining aligned contexts with pretrained word embedding as To better understand the limitations of our model, we
our decoder initialization (ATS, TAS) is a natural choice show some common mistakes in Table 6. For the word bass,
from the experiments. The result is the best one for Large our model generates the wrong definition while picking up
and Unseen datasets, demonstrating better performance and the correct nearest neighbors. Note that the generated wrong
generalizability. definition is another sense of bass, so the cause of this error
The performance is poorer for all models on Unseen may be due to the imbalance of sense frequency in training
rather than other testsets. That is because these words are not data, considering that bass as a kind of fish is a relatively
encountered during training, making the embedding expla- rare sense. For the word tie, the generated definition is cor-
nation much more difficult. Moreover, we manually check rect while the selected nearest neighbors are wrong. Because
the test words and find out that most of them are uncommon the nearest neighbors are determined by (8), this error type
words, making this testset even harder. may be propagated from the SIF sentence embedding.
Human Evaluation
Related Work
In order to justify the quality of the generated definitions,
we randomly select two hundred samples from the Small This work can be viewed as a bridge that connects sparse
dataset for human evaluation, where two settings are re- embeddings and sense embeddings together for better inter-
ported, one includes all words (All) and another includes pretability via definition modeling.
only the words whose multiple(≥3) senses are sampled
(Multi-Sense). There are four candidate models including all Sparse embedding Several works have shown that intro-
baselines, one of our best models (xSense-ATS), and xSense ducing sparsity in word embedding dimensions improves
Target Word Contexts, Generated Definition, Nearest Neighbors
He looked around and saw what he was looking for a band of thin electrical wire.
Gen. Definition: A circular revolving plate supporting a single wire or other object of rock
Nearest Neighbors: inductor, chipset, transceiver (701-th dimension)
band
In her spare time she performs as one of three vocalists in a band.
Gen. Definition: A group of musicians actors or dancers who perform together
Nearest Neighbors: punk, tracklist, hiphop (215-th dimension)
I closed my eyes again and imagined myself in a cool refreshing blue pool.
Gen. Definition: soothing or refreshing because of its low temperature
Nearest Neighbors: humid, moist, wintry (213-th dimension)
cool
There is need to cool off our tempers and stop fanning the embers of dissent.
Gen. Definition: unemotional undemonstrative or impassive dancers who perform together
Nearest Neighbors: levelheaded, gentlemanly, personable (161-th dimension)
It was customary when they finished to bow as a sign of respect to their master.
Gen. Definition: a gesture of acknowledgement or concession to
Nearest Neighbors: palanquin, casket, limousine (143-th dimension)
bow
Pat was wearing a black spandex long sleeved shirt with a thin thread tied in a bow
Gen. Definition: a length of cord rope wire or other material serving a particular purpose
Nearest Neighbors: embroidery, ribbon, fabric (782-th dimension)
Table 5: The analysis of the generated definition and the highest value of a single dimension in a sparse vector.
dimension interpretability (Murphy, Talukdar, and Mitchell or learn word embeddings. In the ranking tasks, the models
2012; Fyshe et al. 2015) and the benefit of word embeddings are evaluated by how well they rank words for given defi-
as features in downstream tasks (Guo et al. 2014). They fo- nitions (Hill et al. 2015) or definitions for words (Noraset
cused on investigating the internal characteristics of word et al. 2017). Aside from ranking tasks, Bahdanau et al. sug-
embeddings, making it hard to perform real-world applica- gested using definitions to compute embeddings for out-of-
tions such as word sense disambiguation (WSD). In addi- vocabulary words. Different from them, this paper focuses
tion, they cannot provide explicit textual definitions of word on utilizing the textural definitions to provide the capabil-
embeddings. ity of explaining the embeddings via human understandable
natural language.
Sense-level embedding In literature, most of the prior
works assigned a vector representation for each sense of a Conclusion
word. The work often assumed a large training corpus to In this paper, the interpretability of word embedding dimen-
facilitate the training of multi-sense embeddings in an un- sions is investigated. Our proposed model is able to pin
supervised manner (Reisinger and Mooney 2010; Li and Ju- down a specific dimension on its sparse representation via
rafsky 2015; Lee and Chen 2017). Note that the sense em- an attention mechanism in an unsupervised manner and gen-
beddings in our framework are disentangled internally by a erate its corresponding textual definition at the same time. In
sparse autoencoder. In this paper, the additional training data the experiments, the proposed model outperforms others for
is not required. Also, unlike the prior work, our model can both quantitative results and human evaluation. Finally, we
provide human-readable definitions for better interpretabil- release a new and high-quality dataset which is five times
ity. larger than the currently available one, providing potential
directions for future research work.
Dictionary definition task There are several works that
utilized dictionary definitions to perform the ranking task
References Lipton, Z. C. 2016. The mythos of model interpretability.
arXiv preprint arXiv:1606.03490.
Arora, S.; Li, Y.; Liang, Y.; Ma, T.; and Risteski, A. 2016. A
latent variable model approach to pmi-based word embed- Makhzani, A., and Frey, B. 2013. K-sparse autoencoders.
dings. Transactions of the Association for Computational arXiv preprint arXiv:1312.5663.
Linguistics 4:385–399. Maleki, A. 2009. Coherence analysis of iterative threshold-
Arora, S.; Li, Y.; Liang, Y.; Ma, T.; and Risteski, A. 2018. ing algorithms. In Communication, Control, and Comput-
Linear algebraic structure of word senses, with applications ing, 2009. Allerton 2009. 47th Annual Allerton Conference
to polysemy. Transactions of the Association of Computa- on, 236–243. IEEE.
tional Linguistics 6:483–495. Murphy, B.; Talukdar, P.; and Mitchell, T. 2012. Learn-
Arora, S.; Liang, Y.; and Ma, T. 2016. A simple but tough- ing effective and interpretable semantic models using non-
to-beat baseline for sentence embeddings. negative sparse embedding. 1933–1950.
Bahdanau, D.; Bosc, T.; Jastrzbski, S.; Grefenstette, E.; Vin- Noraset, T.; Liang, C.; Birnbaum, L.; and Downey, D. 2017.
cent, P.; and Bengio, Y. 2017. Learning to compute word Definition modeling: Learning to define word embeddings
embeddings on the fly. arXiv preprint arXiv:1706.00286. in natural language. In Proceedings of AAAI.
Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002.
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.;
Bleu: a method for automatic evaluation of machine transla-
Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning
tion. In Proceedings of the 40th annual meeting on associa-
phrase representations using rnn encoder-decoder for statis-
tion for computational linguistics, 311–318. Association for
tical machine translation. arXiv preprint arXiv:1406.1078.
Computational Linguistics.
Conneau, A.; Lample, G.; Ranzato, M.; Denoyer, L.; and
Reisinger, J., and Mooney, R. J. 2010. Multi-prototype
Jégou, H. 2017. Word translation without parallel data.
vector-space models of word meaning. In Human Lan-
arXiv preprint arXiv:1710.04087.
guage Technologies: The 2010 Annual Conference of the
Faruqui, M.; Tsvetkov, Y.; Yogatama, D.; Dyer, C.; and North American Chapter of the Association for Computa-
Smith, N. 2015. Sparse overcomplete word vector repre- tional Linguistics, 109–117. Association for Computational
sentations. arXiv preprint arXiv:1506.02004. Linguistics.
Fyshe, A.; Wehbe, L.; Talukdar, P. P.; Murphy, B.; and Subramanian, A.; Pruthi, D.; Jhamtani, H.; Berg-
Mitchell, T. M. 2015. A compositional and interpretable Kirkpatrick, T.; and Hovy, E. 2017. Spine: Sparse
semantic space. In Proceedings of the 2015 Conference of interpretable neural embeddings. arXiv preprint
the North American Chapter of the Association for Compu- arXiv:1711.08792.
tational Linguistics: Human Language Technologies, 32–41.
Gadetsky, A.; Yakubovskiy, I.; and Vetrov, D. 2018. Condi-
tional generators of words definitions. In Proceedings of the
56th Annual Meeting of the Association for Computational
Linguistics (Volume 2: Short Papers), 266–271.
Guo, J.; Che, W.; Wang, H.; and Liu, T. 2014. Revisiting
embedding features for simple semi-supervised learning. In
Proceedings of the 2014 Conference on Empirical Methods
in Natural Language Processing (EMNLP), 110–120.
Hill, F.; Cho, K.; Korhonen, A.; and Bengio, Y. 2015.
Learning to understand phrases by embedding the dictio-
nary. arXiv preprint arXiv:1504.00548.
Kingma, D. P., and Ba, J. 2014. Adam: A method for
stochastic optimization. arXiv preprint arXiv:1412.6980.
Kiros, R.; Zhu, Y.; Salakhutdinov, R. R.; Zemel, R.; Urta-
sun, R.; Torralba, A.; and Fidler, S. 2015. Skip-thought vec-
tors. In Advances in neural information processing systems,
3294–3302.
Lee, G.-H., and Chen, Y.-N. 2017. Muse: Modular-
izing unsupervised sense embeddings. arXiv preprint
arXiv:1704.04601.
Li, J., and Jurafsky, D. 2015. Do multi-sense embeddings
improve natural language understanding? arXiv preprint
arXiv:1506.01070.
Lin, C.-Y. 2004. Rouge: A package for automatic evaluation
of summaries. Text Summarization Branches Out.