Hierarchical NNLM Aistats05
Hierarchical NNLM Aistats05
when used in a speech recognition system (Schwenk and where wt is the word at position t in a text and wt ∈ V ,
Gauvain, 2002; Schwenk, 2004; Xu, Emami and Jelinek, the vocabulary. The conditional probability is estimated
P a normalized function f (wt , wt−1 , . . . , wt−n+1 ), with
2003). In (Schwenk and Gauvain, 2002; Schwenk, 2004), by
it is shown how the model can be used to directly improve v f (v, wt−1 , . . . , wt−n+1 ) = 1.
speech recognition performance. In (Xu, Emami and Je-
In (Bengio, Ducharme and Vincent, 2001; Bengio et al.,
linek, 2003), the approach is generalized to form the vari-
2003) this conditional probability function is represented
ous conditional probability functions required in a stochas-
by a neural network with a particular structure. Its most
tic parsing model called the Structured Language Model,
important characteristic is that each input of this function
and experiments also show that speech recognition perfor-
(a word symbol) is first embedded into a Euclidean space
mance can be improved over state-of-the-art alternatives.
(by learning to associate a real-valued “feature vector” to
However, a major weakness of this approach is the very
each word). The set of feature vectors for all the words
long training time as well as the large amount of compu-
in V is part of the set of parameters of the model, esti-
tations required to compute probabilities, e.g. at the time
mated by maximizing the empirical log-likelihood (minus
of doing speech recognition (or any other application of
weight decay regularization). The idea of associating each
the model). Note that such models could be used in other
symbol with a distributed continuous representation is not
applications of statistical language modeling, such as auto-
new and was advocated since the early days of neural net-
matic translation and information retrieval, but improving
works (Hinton, 1986; Elman, 1990). The idea of using neu-
speed is important to make such applications possible.
ral networks for language modeling is not new either (Mi-
The objective of this paper is thus to propose a ikkulainen and Dyer, 1991; Xu and Rudnicky, 2000), and
much faster variant of the neural probabilistic language is similar to proposals of character-based text compression
model. It is based on an idea that could in principle using neural networks to predict the probability of the next
deliver close to exponential speed-up with respect to the character (Schmidhuber, 1996).
number of words in the vocabulary. Indeed the computa-
There are two variants of the model in (Bengio,
tions required during training and during probability pre-
Ducharme and Vincent, 2001; Bengio et al., 2003): one
diction are a small constant plus a factor linearly propor-
with |V | outputs with softmax normalization (and the target
tional to the number of words |V | in the vocabulary V .
word wt is not mapped to a feature vector, only the context
The approach proposed here can yield a speed-up of order
words), and one with a single output which represents the
O( log|V|V| | ) for the second term. It follows up on a proposal
unnormalized probability for wt given the context words.
made in (Goodman, 2001b) to rewrite a probability func- Both variants gave similar performance in the experiments
tion based on a partition of the set of words. The basic reported in (Bengio, Ducharme and Vincent, 2001; Bengio
idea is to form a hierarchical description of a word as a se- et al., 2003). We will start from the second variant here,
quence of O(log |V |) decisions, and to learn to take these which can be formalized as follows, using the Boltzmann
probabilistic decisions instead of directly predicting each distribution form, following (Hinton, 2000):
word’s probability. Another important idea of this paper
is to reuse the same model (i.e. the same parameters) for
e−g(wt ,wt−1 ,...,wt−n+1 )
all those decisions (otherwise a very large number of mod- f (wt , wt−1 , . . . , wt−n+1 ) = P −g(v,w ,...,w
t−n+1 )
ve
t−1
els would be required and the whole model would not fit
in computer memory), using a special symbolic input that
characterizes the nodes in the tree of the hierarchical de- where g(v, wt−1 , . . . , wt−n+1 ) is a learned function that
composition. Finally, we use prior knowledge in the Word- can be interpreted as an energy, which is low when the tuple
(v, wt−1 , . . . , wt−n+1 ) is “plausible”.
Let F be an embedding matrix (a parameter) with row i, X)P (C = i|X) = P (Y |C = c(Y ), X)P (C =
Fi the embedding (feature vector) for word i. The above c(Y )|X) because only one value of C is compatible with
energy function is represented by a first transformation of the value of Y , the value C = c(Y ).
the input label through the feature vectors Fi , followed by
Although any c(.) would yield correct probabilities, gen-
an ordinary feedforward neural network (with a single out-
eralization could be better for choices of word classes that
put and a bias dependent on v):
“make sense”, i.e. those for which it easier to learn the
P (C = c(y)|X = x).
g(v, wt−1 , . . . , wt−n+1 ) = a′ . tanh(c + W x + U Fv′ ) + bv If Y can take 10000 values and we have 100 classes with
(1) 100 words y in each class, then instead of doing normaliza-
tion over 10000 choices we only need to do two normal-
where x′ denotes the transpose of x, tanh is applied ele-
izations, each over 100 choices. If computation of condi-
ment by element, a, c and b are parameters vectors, W and
tional probabilities is proportional to the number of choices
U are weight matrices (also parameters), and x denotes the
then the above would reduce computation by a factor 50.
concatenation of input feature vectors for context words:
This is approximately what is gained according to the mea-
surements reported in (Goodman, 2001b). The same pa-
x = (Fwt−1 , . . . , Fwt−n+1 )′ . (2) per suggests that one could introduce more levels to the
decomposition and here we push this idea to the limit. In-
Let h be the number of hidden units (the number of rows deed, whereas a one-level decomposition should provide a
of W ) and d the dimension of the embedding (number speed-up on the order of √|V | = |V |, a hierarchical de-
p
of columns of F ). Computing f (wt , wt−1 , . . . , wt−n+1 ) |V |
can be done in two steps: first compute c + W x (requires composition represented by a balanced binary tree should
hd(n − 1) multiply-add operations), and second, for each provide an exponential speed-up, on the order of log|V |V |
|
2
v ∈ V , compute U Fv′ (hd multiply-add operations) and (at least for the part of the computation that is linear in the
the value of g(v, ...) (h multiply-add operations). Hence number of choices).
total computation time for computing f is on the order of Each word v must be represented by a bit vector
(n − 1)hd + |V |h(d + 1). In the experiments reported (b1 (v), . . . bm (v)) (where m depends on v). This can be
in (Bengio et al., 2003), n is around 5, |V | is around 20000, achieved by building a binary hierarchical clustering of
h is around 100, and d is around 30. This gives around words, and a method for doing so is presented in the next
12000 operations for the first part (independent of |V |) and section. For example, b1 (v) = 1 indicates that v belongs
around 60 million operations for the second part (that is to the top-level group 1 and b2 (v) = 0 indicates that it be-
linear in |V |). longs to the sub-group 0 of that top-level group.
Our goal in this paper is to drastically reduce the sec- The next-word conditional probability can thus be rep-
ond part, ideally by replacing the O(|V |) computations by resented and computed as follows:
O(log |V |) computations.
P (v|wt−1 , . . . , wt−n+1 ) =
3 HIERARCHICAL DECOMPOSITION Ym
CAN PROVIDE EXPONENTIAL P (bj (v)|b1 (v), . . . , bj−1 (v), wt−1 , . . . , wt−n+1 )
j=1
SPEED-UP
This can be interpreted as a series of binary stochastic
In (Goodman, 2001b) it is shown how to speed-up a max- decisions associated with nodes of a binary tree. Each node
imum entropy class-based statistical language model by is indexed by a bit vector corresponding to the path from
using the following idea. Instead of computing directly the root to the node (append 1 or 0 according to whether the
P (Y |X) (which involves normalization across all the val- left or right branch of a decision node is followed). Each
ues that Y can take), one defines a clustering partition for leaf corresponds to a word. If the tree is balanced then the
the Y (into the word classes C, such that there is a deter- maximum length of the bit vector is ⌈log2 |V |⌉. Note that
ministic function c(.) mapping Y to C), so as to write we could further reduce computation by looking for an en-
coding that takes the frequency of words into account, to
reduce the average bit length to the unconditional entropy
P (Y = y|X = x) = of words. For example with the corpus used in our experi-
P (Y = y|C = c(y), X)P (C = c(y)|X = x). ments, |V | = 10000 so log2 |V | ≈ 13.3 while the unigram
entropy is about 9.16, i.e. a possible additional speed-up
P true for any functionPc(.) because
This is always of 31% when taking word frequencies into account to bet-
P (Y |X) = i P (Y, C = i|X) = i P (Y |C = ter balance the binary tree. The gain would be greater for
larger vocabularies, but not a very significant improvement sense to force the word embedding to be shared across all
over the major one obtained by using a simple balanced nodes. This is important also because the matrix of word
hierarchy. features F is the largest component of the parameter set.
The “target class” (0 or 1) for each node is obtained di- Since each node in the hierarchy presumably has a se-
rectly from the target word in each context, using the bit mantic meaning (being associated with a group of hope-
encoding of that word. Note also that there will be a target fully similar-meaning words) it makes sense to also as-
(and gradient propagation) only for the nodes on the path sociate each node with a feature vector. Without loss
from the root to the leaf associated with the target word. of generality, we can consider the model to predict
This is the major source of savings during training. P (b|node, wt−1 , . . . , wt−n+1 ) where node corresponds to
a sequence of bits specifying a node in the hierarchy and b
During recognition and testing, there are two main cases
is the next bit (0 or 1), corresponding to one of the two chil-
to consider: one needs the probability of only one word,
dren of node. This can be represented by a model similar to
e.g. the observed word, (or very few) , or one needs the
the one described in Section 2 and (Bengio, Ducharme and
probabilities of all the words. In the first case (which oc-
Vincent, 2001; Bengio et al., 2003) but with two kinds of
curs during testing on a corpus) we still obtain the exponen-
symbols in input: the context words and the current node.
tial speed-up. In the second case, we are back to O(|V |)
We allow the embedding parameters for word cluster nodes
computations (with a constant factor overhead). For the
to be different from those for words. Otherwise the archi-
purpose of estimating generalization performance (out-of-
tecture is the same, with the difference that there are only
sample log-likelihood) only the probability of the observed
two choices to predict, instead of |V | choices.
next word is needed. And in practical applications such as
speech recognition, we are only interested in discriminat- More precisely, the specific predictor used in our exper-
ing between a few alternatives, e.g. those that are consistent iments is the following:
with the acoustics, and represented in a treillis of possible
word sequences.
This speed-up should be contrasted with the one P (b = 1|node, wt−1 , . . . , wt−n+1 ) =
provided by the importance sampling method proposed sigmoid(αnode + β ′ . tanh(c + W x + U Nnode))
in (Bengio and Senécal, 2003). The latter method is based
on the observation that the log-likelihood gradient is the where x is the concatenation of context word features
average over the model’s distribution for P (v|context) of as in eq. 2, sigmoid(y) = 1/(1 + exp(−y)), αi is a bias
the energy gradient associated with all the possible next- parameter playing the same role as bv in eq. 1, β is a weight
words v. The idea is to approximate this average by a vector playing the same role as a in eq. 1, c, W , U and F
biased (but asymptotically unbiased) importance sampling play the same role as in eq. 1, and N gives feature vector
scheme. This approach can lead to significant speed-up embeddings for nodes in a way similar that F gave feature
during training, but because the architecture is unchanged, vector embeddings for next-words in eq. 1.
probability computation during recognition and test still re-
quires O(|V |) computations for each prediction. Instead,
the architecture proposed here gives significant speed-up 5 USING WORDNET TO BUILD THE
both during training and test / recognition. HIERARCHICAL DECOMPOSITION
4 SHARING PARAMETERS ACROSS A very important component of the whole model is the
THE HIERARCHY choice of the words binary encoding, i.e. of the hierar-
chical word clustering. In this paper we combine empir-
ical statistics with prior knowledge from the WordNet re-
If a separate predictor is used for each of the nodes in the source (Fellbaum, 1998). Another option would have been
hierarchy, about 2|V | predictors will be needed. This rep- to use a purely data-driven hierarchical clustering of words,
resents a huge capacity since each predictor maps from the and there are many other ways in which the WordNet re-
context words to a single probability. This might create source could have been used to influence the resulting clus-
problems in terms of computer memory (not all the models tering.
would fit at the same time in memory) as well as overfitting.
The IS-A taxonomy in WordNet organizes semantic
Therefore we have chosen to build a model in which pa-
concepts associated with senses in a graph that is almost a
rameters are shared across the hierarchy. There are clearly
tree. For our purposes we need a tree, so we have manually
many ways to achieve such sharing, and alternatives to the
selected a parent for each of the few nodes that have more
architecture presented here should motivate further study.
than one parent. The leaves of the WordNet taxonomy are
Based on our discussion in the introduction, it makes senses and each word can be associated with more than one
sense. Words sharing the same sense are considered to be
synonymous (at least in one of their uses). For our pur-
pose we have to choose one of the senses for each word
(to make the whole hierarchy one over words) and we se-
lected the most frequent sense. A straightforward extension
of the proposed model would keep the semantic ambiguity
of each word: each word would be associated with sev-
eral leaves (senses) of the WordNet hierarchy. This would
require summing over all those leaves (and corresponding
paths to the root) when computing next-word probabilities.
Note that the WordNet tree is not binary: each node
may have many more than two children (this is particu-
larly a problem for verbs and adjectives, for which Word-
Net is shallow and incomplete). To transform this hierar-
chy into a binary tree we perform a data-driven binary hi-
erarchical clustering of the children associated with each
node, as illustrated in Figure 1. The K-means algorithm is
used at each step to split each cluster. To compare nodes,
we associate each node with the subset of words that it
covers. Each word is associated with a TF/IDF (Salton
111
000 111
000
and Buckley, 1988) vector of document/word occurrence
counts, where each “document” is a paragraph in the train-
000
111
000
111 000
111
000
111
000
111 000
111
ing corpus. Each node is associated with the dimension-
wise median of the TF/IDF scores. Each TF/IDF score
000
111 000
111
is the occurrence frequency of the word in the document
times the logarithm of the ratio of the total number of doc-
uments by the number of documents containing the word.
6 COMPARATIVE RESULTS
Figure 1: WordNet’s IS-A hierarchy is not a binary tree:
Experiments were performed to evaluate the speed-up and most nodes have many children. Binary hierarchical clus-
any change in generalization error. The experiments also tering of these children is performed.
compared an alternative speed-up technique (Bengio and
Senécal, 2003) that is based on importance sampling (but
only provides a speed-up during training). The experiments and that the model generalizes well. As usual in statis-
were performed on the Brown corpus, with a reduced vo- tical language modeling this is measured by the model’s
cabulary size of 10,000 words (the most frequent ones). perplexity on the test data, which is the exponential of
The corpus has 1,105,515 occurrences of words, split into the average negative log-likehood on that data set. Train-
3 sets: 900,000 for training, 100,000 for validation (model ing is performed over about 20 to 30 epochs according to
selection), and 105,515 for testing. The validation set was validation set perplexity (early stopping). Table 3 shows
used to select among a small number of choices for the size the comparative generalization performance of the differ-
of the embeddings and the number of hidden units. ent architectures, along with that of an interpolated trigram
and a class-based n-gram (same procedures as in (Bengio
et al., 2003), which follow respectively (Jelinek and Mer-
The results in terms of raw computations (time to pro-
cer, 1980) and (Brown et al., 1992; Ney and Kneser, 1993;
cess one example), either during training or during test
Niesler, Whittaker and Woodland, 1998)). The validation
are shown respectively in Tables 1 and 2. The computa-
set was used to choose the order of the n-gram and the
tions were performed on Athlon processors with a 1.2 GHz
number of word classes for the class-based models. We
clock. The speed-up during training is by a factor greater
used the implementation of these algorithms in the SRI
than 250 and during test by a factor close to 200. These are
Language Modeling toolkit, described by (Stolcke, 2002)
impressive but less than the |V |/ log2 |V | ≈ 750 that could
and in www.speech.sri.com/projects/srilm/.
be expected if there was no overhead and no constant term
Note that better performance should be obtainable with
in the computational cost.
some of the tricks in (Goodman, 2001a). Combining the
It is also important to verify that learning still works neural network with a trigram should also decrease its per-
Time per Time per speed-up Validation Test
architecture epoch (s) ex. (ms) perplexity perplexity
original neural net 416 300 462.6 1 trigram 299.4 268.7
importance sampling 6 062 6.73 68.7 class-based 276.4 249.1
hierarchical model 1 609 1.79 258 original neural net 213.2 195.3
importance sampling 209.4 192.6
Table 1: Training time per epoch (going once through all hierarchical model 241.6 220.7
the training examples) and per example. The original neu-
ral net is as described in sec. 2. The importance sam- Table 3: Test perplexity for the different architectures and
pling algorithm (Bengio and Senécal, 2003) trains the same for an interpolated trigram. The hierarchical model per-
model faster. The hierarchical model is the one proposed formed a bit worse than the original neural network, but
here, and it yields a speed-up not only during training but is still better than the baseline interpolated trigram and the
for probability predictions as well (see the next table). class-based model.
Time per speed-up From a linguistic point of view, one of the weaknesses
architecture example (ms) of the above model is that it considers word clusters as de-
original neural net 270.7 1 terministic functions of the word, but uses the nodes in
importance sampling 221.3 1.22 WordNet’s taxonomy to help define those clusters. How-
hierarchical model 1.4 193 ever, WordNet provides word sense ambiguity information
which could be used for linguistically more accurate mod-
Table 2: Test time per example for the different algorithms. eling. The hierarchy would be a sense hierarchy instead of
See Table 1’s caption. It is at test time that the hierarchical a word hiearchy, and each word would be associated with
model’s advantage becomes clear in comparison to the im- a number of senses (those allowed for that word in Word-
portance sampling technique, since the latter only brings a Net). In computing probabilities, this would involve sum-
speed-up during training. ming over several paths from the root, corresponding to the
different possible senses of the word. As a side effect, this
could provide a word sense disambiguation model, and it
plexity, as already shown in (Bengio et al., 2003). could be trained both on sense-tagged supervised data and
on unlabeled ordinary text. Since the average number of
As shown in Table 3, the hierarchical model does not
senses per word is small (less than a handful), the loss in
generalize as well as the original neural network, but the
speed would correspondingly be small.
difference is not very large and still represents an improve-
ment over the benchmark n-gram models. Given the very
large speed-up, it is certainly worth investigating variations Acknowledgments
of the hierarchical model proposed here (in particular how
to define the hierarchy) for which generalization could be The authors would like to thank the following funding or-
better. Note also that the speed-up would be greater for ganizations for support: NSERC, MITACS, IRIS, and the
larger vocabularies (e.g. 50,000 is not uncommon in speech Canada Research Chairs.
recognition systems).
References
7 CONCLUSION AND FUTURE WORK
Baker, D. and McCallum, A. (1998). Distributional clus-
tering of words for text classification. In SIGIR’98.
This paper proposes a novel architecture for speeding-up
neural networks with a huge number of output classes and Bengio, Y., Ducharme, R., and Vincent, P. (2001). A neu-
shows its usefulness in the context of statistical language ral probabilistic language model. In Leen, T., Diet-
modeling (which is a component of speech recognition and terich, T., and Tresp, V., editors, Advances in Neural
automatic translation systems). This work pushes to the Information Processing Systems 13 (NIPS’00), pages
limit a suggestion of (Goodman, 2001b) but also intro- 933–938. MIT Press.
duces the idea of sharing the same model for all nodes of Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C.
the decomposition, which is more practical when the num- (2003). A neural probabilistic language model. Jour-
ber of nodes is very large (tens of thousands here). The nal of Machine Learning Research, 3:1137–1155.
implementation and the experiments show that a very sig-
nificant speed-up of around 200-fold can be achieved, with Bengio, Y. and Senécal, J.-S. (2003). Quick training of
only a little degradation in generalization performance. probabilistic neural nets by importance sampling. In
Proceedings of AISTATS 2003. Niesler, T. R., Whittaker, E. W. D., and Woodland, P. C.
(1998). Comparison of part-of-speech and automat-
Berger, A. L., Della Pietra, V. J., and Della Pietra, S. A. ically derived category-based language models for
(1996). A maximum entropy approach to natural lan- speech recognition. In International Conference on
guage processing. Computational Linguistics, 22:39– Acoustics, Speech and Signal Processing (ICASSP),
71. pages 177–180.
Brown, P. F., Pietra, V. J. D., DeSouza, P. V., Lai, J. C., Pereira, F., Tishby, N., and Lee, L. (1993). Distributional
and Mercer, R. L. (1992). Class-based n-gram mod- clustering of english words. In 30th Annual Meet-
els of natural language. Computational Linguistics, ing of the Association for Computational Linguistics,
18:467–479. pages 183–190, Columbus, Ohio.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, Salton, G. and Buckley, C. (1988). Term weighting ap-
T. K., and Harshman, R. (1990). Indexing by latent proaches in automatic text retrieval. Information Pro-
semantic analysis. Journal of the American Society cessing and Management, 24(5):513–523.
for Information Science, 41(6):391–407.
Schmidhuber, J. (1996). Sequential neural text com-
Elman, J. L. (1990). Finding structure in time. Cognitive pression. IEEE Transactions on Neural Networks,
Science, 14:179–211. 7(1):142–146.
Fellbaum, C. (1998). WordNet: An Electronic Lexical Schutze, H. (1993). Word space. In Giles, C., Hanson, S.,
Database. MIT Press. and Cowan, J., editors, Advances in Neural Informa-
tion Processing Systems 5 (NIPS’92), pages 895–902,
Goodman, J. (2001a). A bit of progress in language mod- San Mateo CA. Morgan Kaufmann.
eling. Technical Report MSR-TR-2001-72, Microsoft
Research, Redmond, Washington. Schwenk, H. (2004). Efficient training of large neural net-
works for language modeling. In International Joint
Goodman, J. (2001b). Classes for fast maximum entropy Conference on Neural Networks (IJCNN), volume 4,
training. In International Conference on Acoustics, pages 3050–3064.
Speech and Signal Processing (ICASSP), Utah.
Schwenk, H. and Gauvain, J.-L. (2002). Connectionist
Hinton, G. E. (1986). Learning distributed representations language modeling for large vocabulary continuous
of concepts. In Proceedings of the Eighth Annual speech recognition. In International Conference on
Conference of the Cognitive Science Society, pages 1– Acoustics, Speech and Signal Processing (ICASSP),
12, Amherst 1986. Lawrence Erlbaum, Hillsdale. pages 765–768, Orlando, Florida.
Hinton, G. E. (2000). Training products of experts by Stolcke, A. (2002). SRILM - an extensible language mod-
minimizing contrastive divergence. Technical Report eling toolkit. In Proceedings of the International Con-
GCNU TR 2000-004, Gatsby Unit, University Col- ference on Statistical Language Processing, Denver,
lege London. Colorado.
Jelinek, F. and Mercer, R. L. (1980). Interpolated estima- Xu, P., Emami, A., and Jelinek, F. (2003). Train-
tion of Markov source parameters from sparse data. ing connectionist models for the structured language
In Gelsema, E. S. and Kanal, L. N., editors, Pattern model. In Proceedings of the 2003 Conference on
Recognition in Practice. North-Holland, Amsterdam. Empirical Methods in Natural Language Processing
(EMNLP’2003), volume 10, pages 160–167.
Katz, S. M. (1987). Estimation of probabilities from sparse
data for the language model component of a speech Xu, W. and Rudnicky, A. (2000). Can artificial neural
recognizer. IEEE Transactions on Acoustics, Speech, networks learn language models. In International
and Signal Processing, ASSP-35(3):400–401. Conference on Statistical Language Processing, pages
M1–13, Beijing, China.
Miikkulainen, R. and Dyer, M. G. (1991). Natural lan-
guage processing with modular PDP networks and
distributed lexicon. Cognitive Science, 15:343–399.