Explainable Word Sense Networks

The paper presents xSense, an explainable model for generating sense-separated sparse representations and textual definitions of word embeddings, addressing issues of polysemy, dimension understanding, and semantic analysis in NLP. It introduces a high-quality context-definition dataset and demonstrates superior performance in generating human-readable definitions and interpreting word senses through a structured approach involving context encoding, sparse vector extraction, and definition decoding. The proposed model enhances interpretability and usability in various NLP tasks by explicitly linking word senses to specific dimensions in the sparse representation.

Uploaded by

sushant bahadure

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views8 pages

Explainable Word Sense Networks

Uploaded by

sushant bahadure

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

xSense: Learning Sense-Separated Sparse Representations and Textual Definitions

for Explainable Word Sense Networks

Ting-Yun Chang∗ Ta-Chung Chi∗ Shang-Chi Tsai∗ Yun-Nung Chen

Department of Computer Science and Information Engineering
National Taiwan University, Taipei, Taiwan
{r06922168, r06922028, r06946004}@ntu.edu.tw [email protected]
arXiv:1809.03348v1 [cs.CL] 10 Sep 2018

Abstract issue (Reisinger and Mooney 2010).

2. Dimension understanding: The higher and lower val-
Despite the success achieved on various natural language ues in the dimensions of an embedding vector are difficult
processing tasks, word embeddings are difficult to interpret
due to the dense vector representations. This paper focuses
to interpret and analyze for human (Subramanian et al.
on interpreting the embeddings for various aspects, includ- 2017).
ing sense separation in the vector dimensions and definition 3. Semantic analysis: We can only indirectly check the
generation. Specifically, given a context together with a tar- nearest neighbors to inspect the semantic meaning of a
get word, our algorithm first projects the target word em- word embedding (Noraset et al. 2017).
bedding to a high-dimensional sparse vector and picks the To address the polysemy issue, Arora et al. recently
specific dimensions that can best explain the semantic mean- showed that a word embedding is the linear combination of
ing of the target word by the encoded contextual informa- its distinct sense embeddings weighted by the correspond-
tion, where the sense of the target word can be indirectly in-
ing frequency in the training corpus (Arora et al. 2018).
ferred. Finally, our algorithm applies an RNN to generate the
textual definition of the target word in the human readable It proposed to use the weighted sum of multiple atoms of
form, which enables direct interpretation of the correspond- discourse to represent a word, where an atom indicates a
ing word embedding. This paper also introduces a large and concept. Unfortunately, the discourse atom itself still suffers
high-quality context-definition dataset that consists of sense from the third issue and is not directly explainable. Although
definitions together with multiple example sentences per pol- it decomposed the vector representation into several atoms
ysemous word, which is a valuable resource for definition with their semantic meanings, it still suffered from the di-
modeling (Noraset et al. 2017) and word sense disambigua- mension understanding issue where the meaning of dimen-
tion. The conducted experiments show the superior perfor- sions cannot be well explained.
mance in BLEU score and the human evaluation test.
In terms of the dimension understanding issue, several
prior works attempted at projecting the dense embeddings
Introduction into a sparse space and finding that words whose certain di-
mensions are large in a spare vector can form a semantic
Recently, machine learning models utilizing deep learning
cluster (Faruqui et al. 2015; Subramanian et al. 2017). Then
methodologies have achieved huge success on various tasks.
they can isolate different senses into different dimensions
However, state-of-the-art models are often extremely com-
and solved the first and the second issues together. Never-
plex and have a huge amount of parameters such that trans-
theless, inspecting nearest neighbors is still the only way to
parency or interpretability are compromised. Researchers
discover the meaning of a word embedding, so the semantic
cannot tell why or how the model makes a specific deci-
analysis issue remains unsolved.
sion, which is particularly problematic when predictions are
related to decision-critical applications such as medical ap- Finally, Noraset et al. tackled the semantic analysis issue
plications. Considering that understanding the underlying by directly generating the textual definition of a word em-
phenomenon in the model is critical, interpretability (Lipton bedding based on a dictionary resource (Noraset et al. 2017).
2016) has therefore arisen as a key desideratum of machine The main concern in this work is that they treated all word as
learning models. monosemous and suffered from the polysemy issue. Gadet-
In natural language processing (NLP), word embeddings sky, Yakubovskiy, and Vetrov tried to address this issue by
have produced significant improvement for different tasks. training an encoder-decoder architecture along with a mask
However, the embeddings are dense representations that hu- to generate context-dependent definitions. However, both of
man finds difficult to interpret, which can be summarized in these methods cannot explain the semantic meaning of the
three main reasons: individual dimension (dimension understanding) (Gadetsky,
1. Polysemy: Word embeddings mix different meanings Yakubovskiy, and Vetrov 2018).
into a single vector, which is also known as the polysemy Based on the above discussions, this paper proposes a
novel explainable model, xSense, that embraces all benefits
∗
The first three authors contribute equally to this work. and avoids drawbacks. That is, the proposed model can solve
Word Definition Example Sentence
bass The lowest adult male singing voice. His bass voice rings out attractively.
These are the opening words of the play, sung as a bass solo.
The common European freshwater perch. Only leisure anglers are allowed to fish bass in Irish waters.
I did manage a couple of hours fishing a bass pool the next morning.

Table 1: Part of content for the word “bass” in the proposed Oxford dataset.

Attribute Theirs Ours To be more specific, our dataset provides the following
#Words 36,767 31,798 guarantees:
Avg. #sentence per def. 1 27 • Each example sentence contains the target word it defines.
POS tag N Y • We include all example sentences of a specific definition
Total sentences 122,319 1,299,821 available in the online dictionary.
Total tokens 3,516,066 18,484,401 • We also include the corresponding POS tag of each word
sense for further research usage.
Table 2: The dataset comparison between the prior The statistics of the proposed dataset is summarized in Ta-
work (Gadetsky, Yakubovskiy, and Vetrov 2018) and the ble 2, where it is obvious that our dataset contains much
proposed one. more example sentences, and the size is about 5 times larger
than one provided by Gadetsky, Yakubovskiy, and Vetrov.
The high-quality and rich dataset can be leveraged in differ-
all three issues together. The contributions of this paper are ent NLP tasks, and this paper utilizes it for learning explain-
4-fold: able word sense networks, xSense.
• Given a (context, word) pair, this paper can explicitly pin
down the dimension in the sparse word representation that xSense: Explainable Word Sense Networks
represents the sense of the word under a given context.
• This paper is able to interpret the value of a specific di- The proposed model, xSense, consists of four main modules
mension in the transformed sparse representation. as illustrated in Figure 1. Given a target word and its con-
• This paper provides the human understandable textual text, the model encodes the contexts (context encoder) and
definition for a particular sense of a word embedding extracts its sparse representation (sparse vector extractor).
given its context. A mask is generated (mask generator) based on the con-
• We release a large and high-quality context-definition texts and the sparse vector in order to find the dimensions
dataset that consists of abundant example sentences and that encode the corresponding sense information, and then
the corresponding definitions for a variety of words. a definition sentence is generated (definition decoder). Each
component is detailed below.
Dictionary Corpus Dual Vocabularies
Dictionary corpora are usually available in the online elec- We propose dual vocabularies, Vw2v and Vdec , used in our
tronic format. However, they are often lack of example sen- model. The first one is the pretrained embeddings from
tences. To the best of our knowledge, the Oxford online dic- word2vec2 used by the encoder and sparse vector extrac-
tionary is the only one that contains an abundant amount tor. The second one is randomly initialized and is only used
of example sentences.1 The prior work recently released a by the decoder. The goal of using two sets of vocabularies
dataset based on this resource (Gadetsky, Yakubovskiy, and is to lower the out-of-vocabulary (OOV) rate. To be more
Vetrov 2018). However, their dataset does not contain com- specific, while Vw2v contains lots of tokens, it misses some
plete information achievable online, which hinders the usage common functional words such as ‘a’ and ‘of’. In order to
for diverse tasks. Some findings are described here: 1) Their generate such common words in definition sentences, the
dataset only provided one single example sentence for a defi- dedicated vocabulary Vdec is adopted.
nition, while there are usually multiple ones online. 2) Some
provided example sentences in the dataset do not contain the Context Encoder
target word, making the usage difficult. 3) Some provided
example sentences do not align with their target word and Given a context, the encoder module generates a distinguish-
the associated definitions. Considering the quality of the re- able and meaningful sentence embedding. Because we do
leased dataset, this paper addresses these problems from the not assume additional resource for training the sentence em-
prior work by releasing the newly-collected dataset and the bedding, the sentence is encoded in an unsupervised man-
toolkit for crawling the content. A word example along with ner, which can be obtained in either sophisticated neural-
its multiple definitions and associated example sentences are based (Kiros et al. 2015) or weighted-sum-based (Arora,
shown in Table 1. Liang, and Ma 2016) methods. The latter method is chosen
1 2
https://en.oxforddictionaries.com/ https://code.google.com/archive/p/word2vec/
Mask Generator Definition Decoder
Dot Product as Attention Weights Definition … edible fruit …
Top K Rows in 𝑾 𝑦෤1 𝑦෤𝑡−1 𝑦෤𝑡
Aligned
Contexts Target Word 𝑦1 𝑦𝑡−1 𝑦𝑡 EOS
𝛼𝑖 Embedding 𝑣𝑤

…… + Aligned Contexts
Alignment
…… BOS
𝑆𝑖
Sense Vector 𝑚

Embedded Contexts apple

active

Smooth Inverse Frequency (SIF) inactive

𝑊’
+ Weighted Sum
𝑎𝑡−1 amazon banana
Corpus 𝑎𝑡 𝑎𝑡+1 𝑧𝑤𝑡
google orange
……
𝑊 Sparse Representation
𝑤𝑡−1 𝑤𝑡 𝑤𝑡+1
Contexts An apple a day , …
Target Word 𝑤𝑡
Context Encoder apple Sparse Vector Extractor

Figure 1: Illustration of the proposed xSense model. Encoder does not have parameters to train. The sparse extractor is pretrained
and fixed during the training of mask generator and the decoder.

in this paper due to two reasons. First, neural-based meth- al. 2017):
ods require additional training data and much longer train-
ing time. Second, considering the goal of this paper is in- zw = f (Wenc vw + benc ) (2)
0
terpretability, weighted-sum method is more transparent for vw = Wdec zw + bdec (3)
humans to interpret and investigate the error.3
In our weighted-sum approach, we apply the smooth in- where f is the capped-ReLU activation function, Wenc ∈
verse frequency (SIF) embeddings (Arora, Liang, and Ma IRm×d , benc ∈ IRm , Wdec ∈ IRd×m , bdec ∈ IRd are the
2016), which is inspired by the discourse random walk learning parameters, and d is the dimension of the word em-
model (Arora et al. 2016). Formally, given word embeddings bedding.
vw : w ∈ Vw2v , a sentence s ∈ S, where S is the set of This formulation follows a regular k-sparse autoencoder
all training sentences, a smoothing parameter a, and the oc- aiming at minimizing reconstruction loss and partial spar-
currence probabilities p(w) : w ∈ Vw2v of the words derived sity loss (Makhzani and Frey 2013; Subramanian et al.
from the training corpus, SIF computes: 2017). Makhzani and Frey pointed out that the k-sparse au-
toencoder can be viewed as the variant of iterative thresh-
1 X a olding with the inversion algorithm (Maleki 2009), which
vs = vw , (1)
|s| w∈s a + p(w) aims to train an overcomplete matrix W as orthogonal as
possible. After training, W can be used as the dictionary
where |s| is the length of sentence s and vs will be used in in the sparse recovery stage. In the context of word embed-
the mask generator to generate the attention mask. dings, the matrix Wdec contains the orthogonal basis of the
embedding space, which are likely to be the basic semantic
Sparse Vector Extractor components.
Words have large values in a specific dimension of We link this observation to the discourse atom, the basic
their sparse representations often form a semantic clus- sense component (Arora et al. 2018). Arora et al. showed
ter (Faruqui et al. 2015; Subramanian et al. 2017). This char- that a set of word embeddings can be disentangled into mul-
acteristic helps interpret the semantics in different dimen- tiple discourse vectors by sparse coding. Formally, given
sions. Inspired by the idea about sparse coding in Subrama- word embeddings vw : w ∈ Vw2v in IRd and an integer
nian et al., we incorporate a sparse vector extractor to learn m d, the goal is to solve:
the sparse representation of the target word (Subramanian et 2
X m
X
3
We also tried training a bidirectional GRU encoder, the perfor- vw − αw,j Aj , (4)
mance is roughly the same. w∈Vw2v j=1
2
where αw,j represents how much the discourse vector Aj where M is the number of words in the definition. We assign
weighs in constituting vw . Both Wdec and the discourse the aligned SIF embedding Tvs to the initial hidden state of
atom are the basic semantic components of the embedding the first-layer GRU and the target word embedding vw to the
space. Moreover, from the viewpoint of matrix operation, initial state of the second-layer GRU illustrated in Figure 1:
(4) is equivalent to (3) with αw,j = zw,j and Aj = Wdec,j , h10 =Tvs , (11)
where Wdec,j is the j-th column of the matrix. In practice,
since zw is directly generated by Wenc , we use the corre- h20 =vw . (12)
sponding row vectors of Wenc in the mask generator. As The goal of using the pretrained target word embedding as
illustrated in Figure 1, the sparse vector extractor focuses on the initial hidden state is to provide explicit signal for the
decomposing different senses into different dimensions via model in order to generate coherent and consistent defini-
sparse coding, and the trained sparse encoder is for the mask tions. We also conduct the experiments using signals other
generator usage. than vw in the experiment section to analyze the effective-
ness. This initialization conditions the decoder to correctly
Mask Generator generate definitions. For each decoding step, the input to the
The mask generator module is the key for interpretability, cell is concatenated as:
which connects the encoder and the sparse extractor and xt = [vg , m], (13)
automatically finds the sense-specific dimensions. Given where vg ∈ Vdec is the ground truth word embedding at t-
the SIF embedding vs and a target word embedding vw , th timestep and m is the sense vector calculated as (9). The
we focus on extracting the sense information from vw ac- decoding process terminates when an end-of-sentence token
cording to its contexts. vw is first fed into the sparse vector is predicted. The internal structure of a GRU cell is:
extractor to produce its sparse representation zw . We then
lookup K highest values in the sparse vector and retrieve rt = σ(Wr · [ht−1 , xt ]), (14)
the corresponding vectors in Wenc , which is learned from zt = σ(Wz · [ht−1 , xt ]), (15)
the sparse vector extractor. Formally, we compute the sparse h̃t = tanh(Wh̃ · [rt ∗ ht−1 , xt ]), (16)
representation of the target word by (2) and obtain K largest
values: ht = (1 − zt ) ∗ ht−1 + zt ∗ h̃t . (17)
γ1···K = argsortK (zw ). (5) The output is generated by passing the hidden state through
We retrieve the rows of Wenc according to the indices ob- a linear layer:
tained in (5): Ot = Wo · ht . (18)
sj = Wenc [γj ], j ∈ 1 · · · K. (6) where Wo ∈ IR|Vdec |×d . We use Ot to generate the final
distribution over Vdec via softmax operation. Formally,
sj is therefore the γj -th row vector of Wenc . We calculate
the inner product between the sentence embedding vs and exp(Ot,i )
pt,i = P , (19)
the basis vectors sj to generate a weighted mask. However, j exp(Ot,j )
the direct calculation is unreasonable since they do not align yt = arg max pt,i (20)
well in the vector space. Because both vs and sj are derived i
from the same pretrained embeddings by almost-linear oper- Note that during the testing phase, the decoder is auto-
ations, we assume that learning an additional linear transfor- regressive. Formally, (13) becomes:
mation T ∈ IRd×d can effectively align the space (Conneau xt = [vyt , m]. (21)
et al. 2017). The inner product is thus calculated after the
transformation: Optimization
d1···k = Tvs sj , j ∈ 1 · · · K. (7) There are two losses for optimizing the sparse extractor,
where the first loss is the reconstruction loss:
The mask is calculated by a softmax layer: 1 X 0 2
exp(dj ) LR (D) = |vw − vw | , (22)
αj = P , j ∈ 1 · · · K. (8) |D|
w∈D
j exp(dj )
where D is the size of the whole dataset, and the second loss
Finally, the retrieved basis vectors are weighted by the mask is the partial sparsity loss (Subramanian et al. 2017):
and then the sense vector is formed: 1 XX
X LP S (D) = (vw,h (1 − vw,h )). (23)
m= αj · sj , j ∈ 1 · · · K. (9) |D|
x∈D h
j
This loss encourages every dimension h of vw to be either
Definition Decoder 0 or 1. Note that the sparse extractor module is pretrained
and fixed. In order to train the whole model in an end-to-
The decoder module generates a textual definition for a tar-
end fashion, we minimize the negative log likelihood over
get word given its context. GRU is applied as our recurrent
maximum decoding steps M :
unit (Cho et al. 2014). We denote a target definition sentence
M
as a sequence of tokens: X
LN LL = − log pt (y˜t ). (24)
ỹ = {y˜1 ,y˜2 , . . . y˜M } , (10) t=1
Datasets
Methods
Large Small Unseen
1) Baseline w/o contexts
Noraset et al. (2017) 33.8 / 36.3 30.5 / 32.7 12.0 / 13.3
2) Baseline w/ contexts
Seq2Seq 20.1 / 21.1 18.3 / 18.7 11.3 / 10.5
Gadetsky et al. (2018) 26.0 / 31.6 25.5 / 30.4 9.8 / 11.3
3) Proposed 1-Layer Init 2-Layer Init Each Time Input
SSS Sense Vector Sense Vector Sense Vector 14.8 / 17.0 14.4 / 15.9 12.1 / 13.3
AAS Aligned Contexts Aligned Contexts Sense Vector 20.6 / 23.0 18.6 / 20.3 12.4 / 13.9
xSense TTS Target Word Target Word Sense Vector 33.6 / 35.9 29.4 / 31.3 11.9 / 14.2
ATS Aligned Contexts Target Word Sense Vector 37.2 / 39.7 30.1 / 32.0 12.7 / 14.5
TAS Target Word Aligned Contexts Sense Vector 40.0 /42.6 31.9 /33.9 12.4 /13.2

Table 3: BLEU and ROUGE-L scores for baselines and various proposed architectures. (BLEU / ROUGE-L:F1).

Experiments the second one does. The baseline without contexts is es-
sentially a language model conditioning on the pretrained
To evaluate our proposed model, we conduct various sets of
word embeddings Vw2v , which shares the same architec-
experiments using our newly collected Oxford dataset.
ture in Noraset et al.. We reimplement the model and train
on our proposed dataset for fair comparison. For baselines
Setting with contexts, we train the model proposed by Gadetsky,
Hyperparameters Both Vw2v and Vdec have dimension Yakubovskiy, and Vetrov with their strongest settings on our
300. For the encoder, we fix the smoothing term a in (1) dataset and a vanilla sequence-to-sequence model with both
to 10−3 as recommended (Arora, Liang, and Ma 2016). For encoder and decoder being a two-layer GRU network.
the sparse vector extractor, the similar setup is adopted (Sub-
ramanian et al. 2017). We choose K = 5 in the mask gener- Proposed Variants We tried different input variants of
ator. The definition decoder is a two-layer GRU (Cho et al. (11), (12), and (13) to see the effectiveness of inputting the
2014) with the hidden size 300, where the optimizer used explicit signal during decoding. Specifically, for the 1-layer
is SGD with the learning rate 0.1 for training the sparse and 2-layer initialization of GRU and the additional input at
vector extractor and the mask generator. The Adam opti- each time step, different combinations of aligned contexts
mizer (Kingma and Ba 2014) with the default settings is ap- (A), the target word vector (T), and the sense vector (S) are
plied to the decoder. attempted. Note that at least one of the input should be the
sense vector (m in (9)) in order to optimize the mask gener-
Testsets In the experiments, we want to demonstrate the ator.
ability of the proposed model in two difficulty levels.
• Easy: The easier one is to test (seen words, unseen con- Results
texts). Concretely, the small testset is the one proposed The results are shown in Table 3. Among all baselines, No-
by Gadetsky, Yakubovskiy, and Vetrov with 6,809 in- raset et al.’s work is the strongest baseline even though their
stances, while the large testset is the one we collect with model generates exactly the same definition regardless of
42,589 instances. different contexts. The probable reason is that dictionary
• Hard: The harder one is to test (unseen words, unseen definitions are often written in a highly structured and sim-
contexts) in the unseen testset with 808 instances, which ilar format, thus generating the same definition for all con-
consists of all target words that are never seen during texts can still share some common words with the ground
training. truth.
Evaluation Metrics Two objective measures are reported, Among baselines leveraging contexts, the performance
including BLEU (Papineni et al. 2002) up to 4-gram and F of the sequence-to-sequence model is worse than Gadet-
measure of ROUGE-L (Lin 2004). Considering that BLEU sky, Yakubovskiy, and Vetrov’s. The probable reason is that
score has a lot of smoothing strategies, we decide to fol- Gadetsky, Yakubovskiy, and Vetrov introduced a mask to
low prior work (Noraset et al. 2017; Gadetsky, Yakubovskiy, differentiate different contexts and generate definitions ac-
and Vetrov 2018) and use the sentence-BLEU binary in the cordingly. However, their performance is the worst among
Moses library4 for a fair comparison. Both scores are aver- all models on unseen dataset, which explicitly evaluates the
aged across all testing instances. generalizability. The observation tells that the better perfor-
mance on large and small are likely because of memorizing
Baselines Two sets of baseline approaches are compared, the information from the training data (overfitting). In ad-
where the first one does not consider the contexts and dition, the performance gain compared with Noraset et al.’s
work reported in (Gadetsky, Yakubovskiy, and Vetrov 2018)
4
http://www.statmt.org/moses/ is only 0.46 of BLEU (full 100), which is insignificant.
Top 1 Ranking Score
Model
All Multi-Sense All Multi-Sense
Noraset et al. (2017) 311 (30.8%) 17 (28.4%) 1887 (27.6%) 111(27.2%)
Gadetsky et al. (2018) 240 (23.8%) 9 (15.0%) 1701 (24.9%) 92(22.5%)
xSense w/o Alignment 115 (11.4%) 8 (13.3%) 1182 (17.3%) 80 (19.9%)
xSense-ATS (Aligned Contexts/Target Word/Sense Vector) 342 (34.0%) 26(43.3%) 2055 (30.2%) 124 (30.4%)

Table 4: Ranked human evaluation results on randomly sampled 200 questions from small datasets.

To analyze the information richness of different variants, without alignment in (7) that jointly learns the sparse vec-
for two hidden layers, we replace the sense vector as initial- tor extractor. Three human annotators are recruited to rank
ization with the aligned contexts. Comparing between SSS the generated definitions given the target word and its cor-
and AAS in Table 3, using the aligned contexts (Tvs ) as the responding contexts in each sample. Table 4 shows the final
initial hidden state in the decoder outperforms the one only statistics, where the top 1 choice and the accumulated scores
using the sense vector (m). The reason is that the aligned are reported (4: first, 3: second, 2: third, 1: last). Note that
contexts provide the decoder additional information of con- in some samples, two models may generate exactly the same
texts and help generate more sophisticated definitions, while definition and if an annotator picks either of them, we assign
the sense vector is the weighted sum of basis vectors as the same score to another.
shown in (9), which may introduce some errors due to the It can be found that our model performs best among
imperfectness of the sparse vector extractor. all candidates for both settings of all target words and
We also try to replace the sense vector with pretrained tar- multi-sense target words. While Subramanian et al.’s work
get word embedding to initialize the hidden state of the de- achieves the second-best performance, it cannot distinguish
coder, and the significantly better performance is observed different senses since it does not consider the contexts,
(SSS v.s. TTS). This is reasonable because pretrained em- which makes the task about explainable embeddings en-
beddings are trained on a large corpus and thus contain ro- tirely useless. The multi-sense setting indeed shows that our
bust and rich information. In addition, it provides a static proposed model significantly outperforms theirs. The worst
representation that stabilizes the training process of the de- model is the one without alignment, indicating that the basis
coder. However, we find out that while having good perfor- vectors and the sentence embedding do not align in the vec-
mance on BLEU/ROUGE scores, the variety of generated tor space so that the attention cannot be correctly obtained.
definitions is lower than the one of the aligned contexts. In
other words, despite pretrained word embeddings being in- Qualitative Analysis
formative, its semantic meaning is likely dominated by the An important capability of our model is that we can pin
most frequent senses in the training corpus. In fact, we ob- down the dimension in the sparse representation of a target
serve that simply using the target word embedding as the word given its context. This is difficult to tell in numbers,
initial decoder hidden state cannot distinguish the difference so we show some samples for analysis in Table 5. We can
between fine-grained senses; The definitions generated by see that the nearest neighbors and the generated definitions
TTS are the major senses in most testing instances. belong to the same semantic clusters. Moreover, we are able
Finally, to balance between variety and correctness, com- to disentangle multiple senses based on the given contexts.
bining aligned contexts with pretrained word embedding as To better understand the limitations of our model, we
our decoder initialization (ATS, TAS) is a natural choice show some common mistakes in Table 6. For the word bass,
from the experiments. The result is the best one for Large our model generates the wrong definition while picking up
and Unseen datasets, demonstrating better performance and the correct nearest neighbors. Note that the generated wrong
generalizability. definition is another sense of bass, so the cause of this error
The performance is poorer for all models on Unseen may be due to the imbalance of sense frequency in training
rather than other testsets. That is because these words are not data, considering that bass as a kind of fish is a relatively
encountered during training, making the embedding expla- rare sense. For the word tie, the generated definition is cor-
nation much more difficult. Moreover, we manually check rect while the selected nearest neighbors are wrong. Because
the test words and find out that most of them are uncommon the nearest neighbors are determined by (8), this error type
words, making this testset even harder. may be propagated from the SIF sentence embedding.
Human Evaluation
Related Work
In order to justify the quality of the generated definitions,
we randomly select two hundred samples from the Small This work can be viewed as a bridge that connects sparse
dataset for human evaluation, where two settings are re- embeddings and sense embeddings together for better inter-
ported, one includes all words (All) and another includes pretability via definition modeling.
only the words whose multiple(≥3) senses are sampled
(Multi-Sense). There are four candidate models including all Sparse embedding Several works have shown that intro-
baselines, one of our best models (xSense-ATS), and xSense ducing sparsity in word embedding dimensions improves
Target Word Contexts, Generated Definition, Nearest Neighbors
He looked around and saw what he was looking for a band of thin electrical wire.
Gen. Definition: A circular revolving plate supporting a single wire or other object of rock
Nearest Neighbors: inductor, chipset, transceiver (701-th dimension)
band
In her spare time she performs as one of three vocalists in a band.
Gen. Definition: A group of musicians actors or dancers who perform together
Nearest Neighbors: punk, tracklist, hiphop (215-th dimension)
I closed my eyes again and imagined myself in a cool refreshing blue pool.
Gen. Definition: soothing or refreshing because of its low temperature
Nearest Neighbors: humid, moist, wintry (213-th dimension)
cool
There is need to cool off our tempers and stop fanning the embers of dissent.
Gen. Definition: unemotional undemonstrative or impassive dancers who perform together
Nearest Neighbors: levelheaded, gentlemanly, personable (161-th dimension)
It was customary when they finished to bow as a sign of respect to their master.
Gen. Definition: a gesture of acknowledgement or concession to
Nearest Neighbors: palanquin, casket, limousine (143-th dimension)
bow
Pat was wearing a black spandex long sleeved shirt with a thin thread tied in a bow
Gen. Definition: a length of cord rope wire or other material serving a particular purpose
Nearest Neighbors: embroidery, ribbon, fabric (782-th dimension)

Table 5: The analysis of the generated definition and the highest value of a single dimension in a sparse vector.

Target Word Contexts, Ground Truth, Generated Definition, Nearest Neighbors

Don’t worry if all your bass have been what we call schoolie bass which are fish under two or three pounds.
Ground Truth: The common European freshwater perch
bass
Gen. Definition: A bass guitar or double bass. (X)
Nearest Neighbors: yacht, vessel, surf, sail (148-th dimension)
I of course immediately asked him how many knots he could tie.
Ground Truth: form a knot or bow in a ribbon lace
tie
Gen. Definition: form a knot or bow in a ribbon lace
Nearest Neighbors: unbeaten, tiebreaker, victor (780-th dimension) (X)

Table 6: Error analysis of common mistakes made by xSense.

dimension interpretability (Murphy, Talukdar, and Mitchell or learn word embeddings. In the ranking tasks, the models
2012; Fyshe et al. 2015) and the benefit of word embeddings are evaluated by how well they rank words for given defi-
as features in downstream tasks (Guo et al. 2014). They fo- nitions (Hill et al. 2015) or definitions for words (Noraset
cused on investigating the internal characteristics of word et al. 2017). Aside from ranking tasks, Bahdanau et al. sug-
embeddings, making it hard to perform real-world applica- gested using definitions to compute embeddings for out-of-
tions such as word sense disambiguation (WSD). In addi- vocabulary words. Different from them, this paper focuses
tion, they cannot provide explicit textual definitions of word on utilizing the textural definitions to provide the capabil-
embeddings. ity of explaining the embeddings via human understandable
natural language.
Sense-level embedding In literature, most of the prior
works assigned a vector representation for each sense of a Conclusion
word. The work often assumed a large training corpus to In this paper, the interpretability of word embedding dimen-
facilitate the training of multi-sense embeddings in an un- sions is investigated. Our proposed model is able to pin
supervised manner (Reisinger and Mooney 2010; Li and Ju- down a specific dimension on its sparse representation via
rafsky 2015; Lee and Chen 2017). Note that the sense em- an attention mechanism in an unsupervised manner and gen-
beddings in our framework are disentangled internally by a erate its corresponding textual definition at the same time. In
sparse autoencoder. In this paper, the additional training data the experiments, the proposed model outperforms others for
is not required. Also, unlike the prior work, our model can both quantitative results and human evaluation. Finally, we
provide human-readable definitions for better interpretabil- release a new and high-quality dataset which is five times
ity. larger than the currently available one, providing potential
directions for future research work.
Dictionary definition task There are several works that
utilized dictionary definitions to perform the ranking task
References Lipton, Z. C. 2016. The mythos of model interpretability.
arXiv preprint arXiv:1606.03490.
Arora, S.; Li, Y.; Liang, Y.; Ma, T.; and Risteski, A. 2016. A
latent variable model approach to pmi-based word embed- Makhzani, A., and Frey, B. 2013. K-sparse autoencoders.
dings. Transactions of the Association for Computational arXiv preprint arXiv:1312.5663.
Linguistics 4:385–399. Maleki, A. 2009. Coherence analysis of iterative threshold-
Arora, S.; Li, Y.; Liang, Y.; Ma, T.; and Risteski, A. 2018. ing algorithms. In Communication, Control, and Comput-
Linear algebraic structure of word senses, with applications ing, 2009. Allerton 2009. 47th Annual Allerton Conference
to polysemy. Transactions of the Association of Computa- on, 236–243. IEEE.
tional Linguistics 6:483–495. Murphy, B.; Talukdar, P.; and Mitchell, T. 2012. Learn-
Arora, S.; Liang, Y.; and Ma, T. 2016. A simple but tough- ing effective and interpretable semantic models using non-
to-beat baseline for sentence embeddings. negative sparse embedding. 1933–1950.
Bahdanau, D.; Bosc, T.; Jastrzbski, S.; Grefenstette, E.; Vin- Noraset, T.; Liang, C.; Birnbaum, L.; and Downey, D. 2017.
cent, P.; and Bengio, Y. 2017. Learning to compute word Definition modeling: Learning to define word embeddings
embeddings on the fly. arXiv preprint arXiv:1706.00286. in natural language. In Proceedings of AAAI.
Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002.
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.;
Bleu: a method for automatic evaluation of machine transla-
Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning
tion. In Proceedings of the 40th annual meeting on associa-
phrase representations using rnn encoder-decoder for statis-
tion for computational linguistics, 311–318. Association for
tical machine translation. arXiv preprint arXiv:1406.1078.
Computational Linguistics.
Conneau, A.; Lample, G.; Ranzato, M.; Denoyer, L.; and
Reisinger, J., and Mooney, R. J. 2010. Multi-prototype
Jégou, H. 2017. Word translation without parallel data.
vector-space models of word meaning. In Human Lan-
arXiv preprint arXiv:1710.04087.
guage Technologies: The 2010 Annual Conference of the
Faruqui, M.; Tsvetkov, Y.; Yogatama, D.; Dyer, C.; and North American Chapter of the Association for Computa-
Smith, N. 2015. Sparse overcomplete word vector repre- tional Linguistics, 109–117. Association for Computational
sentations. arXiv preprint arXiv:1506.02004. Linguistics.
Fyshe, A.; Wehbe, L.; Talukdar, P. P.; Murphy, B.; and Subramanian, A.; Pruthi, D.; Jhamtani, H.; Berg-
Mitchell, T. M. 2015. A compositional and interpretable Kirkpatrick, T.; and Hovy, E. 2017. Spine: Sparse
semantic space. In Proceedings of the 2015 Conference of interpretable neural embeddings. arXiv preprint
the North American Chapter of the Association for Compu- arXiv:1711.08792.
tational Linguistics: Human Language Technologies, 32–41.
Gadetsky, A.; Yakubovskiy, I.; and Vetrov, D. 2018. Condi-
tional generators of words definitions. In Proceedings of the
56th Annual Meeting of the Association for Computational
Linguistics (Volume 2: Short Papers), 266–271.
Guo, J.; Che, W.; Wang, H.; and Liu, T. 2014. Revisiting
embedding features for simple semi-supervised learning. In
Proceedings of the 2014 Conference on Empirical Methods
in Natural Language Processing (EMNLP), 110–120.
Hill, F.; Cho, K.; Korhonen, A.; and Bengio, Y. 2015.
Learning to understand phrases by embedding the dictio-
nary. arXiv preprint arXiv:1504.00548.
Kingma, D. P., and Ba, J. 2014. Adam: A method for
stochastic optimization. arXiv preprint arXiv:1412.6980.
Kiros, R.; Zhu, Y.; Salakhutdinov, R. R.; Zemel, R.; Urta-
sun, R.; Torralba, A.; and Fidler, S. 2015. Skip-thought vec-
tors. In Advances in neural information processing systems,
3294–3302.
Lee, G.-H., and Chen, Y.-N. 2017. Muse: Modular-
izing unsupervised sense embeddings. arXiv preprint
arXiv:1704.04601.
Li, J., and Jurafsky, D. 2015. Do multi-sense embeddings
improve natural language understanding? arXiv preprint
arXiv:1506.01070.
Lin, C.-Y. 2004. Rouge: A package for automatic evaluation
of summaries. Text Summarization Branches Out.

Sense2Vec: Fast Word Sense Disambiguation
No ratings yet
Sense2Vec: Fast Word Sense Disambiguation
9 pages
Madhav Institute of Technology & Science, Gwalior
No ratings yet
Madhav Institute of Technology & Science, Gwalior
13 pages
BERT for Word Sense Disambiguation
No ratings yet
BERT for Word Sense Disambiguation
10 pages
Foundations of Text Representation, LLMs and Transformers
No ratings yet
Foundations of Text Representation, LLMs and Transformers
87 pages
Hindi Word Sense Disambiguation Method
No ratings yet
Hindi Word Sense Disambiguation Method
17 pages
4 Word Representation
No ratings yet
4 Word Representation
41 pages
XCS224N Module1 Slides
No ratings yet
XCS224N Module1 Slides
72 pages
Notes
No ratings yet
Notes
37 pages
2019 Wiedemannetal Konvens Bert 1
No ratings yet
2019 Wiedemannetal Konvens Bert 1
2 pages
Semantic Parsing
No ratings yet
Semantic Parsing
79 pages
Neural Models For NLP
No ratings yet
Neural Models For NLP
67 pages
ML4D-L6 nlp2
No ratings yet
ML4D-L6 nlp2
58 pages
08 Word Embeddings (2021)
No ratings yet
08 Word Embeddings (2021)
58 pages
Trigram 11
No ratings yet
Trigram 11
16 pages
Contextual Word Embeddings
No ratings yet
Contextual Word Embeddings
8 pages
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
No ratings yet
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
18 pages
NLP and Word Vector Representation
No ratings yet
NLP and Word Vector Representation
86 pages
A Metaheuristic With A Neural Surrogate Function - 2022 - Machine Learning With
No ratings yet
A Metaheuristic With A Neural Surrogate Function - 2022 - Machine Learning With
11 pages
Word Representation Techniques
No ratings yet
Word Representation Techniques
13 pages
Intro to NLP & Word Vectors
No ratings yet
Intro to NLP & Word Vectors
42 pages
NLP Deep Learning for Students
No ratings yet
NLP Deep Learning for Students
57 pages
18 Word Senses and WordNet
No ratings yet
18 Word Senses and WordNet
22 pages
Understanding Transformers in AI
No ratings yet
Understanding Transformers in AI
8 pages
3 WordMeaning
No ratings yet
3 WordMeaning
78 pages
11 Word 2 Vec
No ratings yet
11 Word 2 Vec
21 pages
Modeling Word Interpretation With Deep Language Models: The Interaction Between Expectations and Lexical Information
No ratings yet
Modeling Word Interpretation With Deep Language Models: The Interaction Between Expectations and Lexical Information
7 pages
2019 Wiedemannetal Konvens Bert 5
No ratings yet
2019 Wiedemannetal Konvens Bert 5
2 pages
Word Vectors and Text Classification Techniques
No ratings yet
Word Vectors and Text Classification Techniques
52 pages
Recent Trends in Word Sense Disambiguation
No ratings yet
Recent Trends in Word Sense Disambiguation
10 pages
NLP Unit 3
No ratings yet
NLP Unit 3
20 pages
CCS369 - TSS-Unit 2
No ratings yet
CCS369 - TSS-Unit 2
56 pages
Word Sense Disambiguation in Natural Language Processing - GeeksforGeeks
No ratings yet
Word Sense Disambiguation in Natural Language Processing - GeeksforGeeks
6 pages
2019 Wiedemannetal Konvens Bert 2
No ratings yet
2019 Wiedemannetal Konvens Bert 2
2 pages
1 s2.0 S2215039024000079 Main
No ratings yet
1 s2.0 S2215039024000079 Main
7 pages
11.chapter8 WordEmbedding
No ratings yet
11.chapter8 WordEmbedding
17 pages
Master Thesis
No ratings yet
Master Thesis
74 pages
Word Embedding
No ratings yet
Word Embedding
35 pages
Problem Statement NLP WSD
No ratings yet
Problem Statement NLP WSD
9 pages
Language Analysis - Sociolinguistics of Word Embeddings - PREPRINT - 8.8.2020
No ratings yet
Language Analysis - Sociolinguistics of Word Embeddings - PREPRINT - 8.8.2020
17 pages
5 Word Embeddingfor Understanding Natural Language ASurvey 1
No ratings yet
5 Word Embeddingfor Understanding Natural Language ASurvey 1
26 pages
Word Embedding Survey for NLP
No ratings yet
Word Embedding Survey for NLP
13 pages
Linear Algebraic Structure of Word Senses, With Applications To Polysemy
No ratings yet
Linear Algebraic Structure of Word Senses, With Applications To Polysemy
14 pages
Word Embedding Techniques Explained
No ratings yet
Word Embedding Techniques Explained
9 pages
Vector Semantics and Embeddings
No ratings yet
Vector Semantics and Embeddings
29 pages
Unit 2
No ratings yet
Unit 2
15 pages
Nlp-Unit Iii
No ratings yet
Nlp-Unit Iii
74 pages
Word Sense Disambiguation Guide
No ratings yet
Word Sense Disambiguation Guide
6 pages
Word Vectors in NLP: Lecture 2 Overview
No ratings yet
Word Vectors in NLP: Lecture 2 Overview
40 pages
NLP Word Vectors for Students
No ratings yet
NLP Word Vectors for Students
33 pages
Unit 3-1
No ratings yet
Unit 3-1
66 pages
Unit 2
No ratings yet
Unit 2
48 pages
Understanding Word Embeddings' Semantics
No ratings yet
Understanding Word Embeddings' Semantics
34 pages
From Word Vectors To Multimodal Embeddings: Techniques, Applications, and Future Directions For Large Language Models
No ratings yet
From Word Vectors To Multimodal Embeddings: Techniques, Applications, and Future Directions For Large Language Models
21 pages
cs224n 2025 Lecture02 Wordvecs2
No ratings yet
cs224n 2025 Lecture02 Wordvecs2
46 pages
Word Embeddings
No ratings yet
Word Embeddings
59 pages
Grade 10 Tourism Brochure Task
No ratings yet
Grade 10 Tourism Brochure Task
5 pages
Specific Heat Capacity Experiment
No ratings yet
Specific Heat Capacity Experiment
2 pages
Competency & Skill Base Review Format
100% (1)
Competency & Skill Base Review Format
1 page
FSSAI Testing Laboratory Scope Overview
No ratings yet
FSSAI Testing Laboratory Scope Overview
1 page
Nooshin 2017
No ratings yet
Nooshin 2017
28 pages
Full The Cure Within A History of Mind Body Medicine Anne Harrington PDF All Chapters
100% (10)
Full The Cure Within A History of Mind Body Medicine Anne Harrington PDF All Chapters
60 pages
Effect of Fine Aggregate Replacement With Class F Fly Ash On The Mechanical Properties of Concrete
No ratings yet
Effect of Fine Aggregate Replacement With Class F Fly Ash On The Mechanical Properties of Concrete
8 pages
Text Winnicott
100% (2)
Text Winnicott
44 pages
(P.P.G. - Dyke) - An - Introduction - To - Laplace - Transformation Anf Fourier Sries2
No ratings yet
(P.P.G. - Dyke) - An - Introduction - To - Laplace - Transformation Anf Fourier Sries2
3 pages
Lesson Plan in Science 5
No ratings yet
Lesson Plan in Science 5
5 pages
Identification of Effective Factors On Mineral Processing Plant Site Selection and Ranking Them by AHP Method
No ratings yet
Identification of Effective Factors On Mineral Processing Plant Site Selection and Ranking Them by AHP Method
7 pages
ChemBioChem - 2024 - Abbas - The Role of AI in Drug Discovery
No ratings yet
ChemBioChem - 2024 - Abbas - The Role of AI in Drug Discovery
21 pages
Global Marketing Power Through Expansion
No ratings yet
Global Marketing Power Through Expansion
38 pages
Pressure
No ratings yet
Pressure
26 pages
Technology Operations and Concepts
No ratings yet
Technology Operations and Concepts
3 pages
Nature's Role in Reducing Fatigue
No ratings yet
Nature's Role in Reducing Fatigue
4 pages
OOP Lab: Java Classes & Interfaces
No ratings yet
OOP Lab: Java Classes & Interfaces
7 pages
Sport Science Assignment
No ratings yet
Sport Science Assignment
20 pages
B.tech Mechanical Engineering
No ratings yet
B.tech Mechanical Engineering
74 pages
Stormwater Management Plan For St. Regis Site: May 17, 2021
No ratings yet
Stormwater Management Plan For St. Regis Site: May 17, 2021
552 pages
Pulsar EXtreme - Datasheet - Eng
No ratings yet
Pulsar EXtreme - Datasheet - Eng
2 pages
Autism SH Managing With Asperger Syndrome A Pra Z Library
No ratings yet
Autism SH Managing With Asperger Syndrome A Pra Z Library
140 pages
English Test for Grade 8 Final Exam
No ratings yet
English Test for Grade 8 Final Exam
5 pages
Developmental Milestones-1 PDF
No ratings yet
Developmental Milestones-1 PDF
19 pages
Grade 10 Biology Exam Prep
No ratings yet
Grade 10 Biology Exam Prep
19 pages
Opposition to PROVE IT Act & Carbon Tariffs
No ratings yet
Opposition to PROVE IT Act & Carbon Tariffs
4 pages
Product Description: Mobilcut Series
No ratings yet
Product Description: Mobilcut Series
4 pages
Electric Vehicle Routing Problem With Single or Multiple Recharges
No ratings yet
Electric Vehicle Routing Problem With Single or Multiple Recharges
8 pages
Global Logistics and Supply Chain Management: Second Edition
No ratings yet
Global Logistics and Supply Chain Management: Second Edition
3 pages

Explainable Word Sense Networks

Uploaded by

Explainable Word Sense Networks

Uploaded by

xSense: Learning Sense-Separated Sparse Representations and Textual Definitions

for Explainable Word Sense Networks

Ting-Yun Chang∗ Ta-Chung Chi∗ Shang-Chi Tsai∗ Yun-Nung Chen

Abstract issue (Reisinger and Mooney 2010).

Embedded Contexts apple

Smooth Inverse Frequency (SIF) inactive

Target Word Contexts, Ground Truth, Generated Definition, Nearest Neighbors

Table 6: Error analysis of common mistakes made by xSense.

You might also like