Hierarchical NNLM Aistats05

Uploaded by

abhishek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views7 pages

Hierarchical NNLM Aistats05

Uploaded by

abhishek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Hierarchical Probabilistic Neural Network Language Model

Frederic Morin Yoshua Bengio

Dept. IRO, Université de Montréal Dept. IRO, Université de Montréal
P.O. Box 6128, Succ. Centre-Ville, P.O. Box 6128, Succ. Centre-Ville,
Montreal, H3C 3J7, Qc, Canada Montreal, H3C 3J7, Qc, Canada
[email protected] [email protected]

Abstract high-dimensional spaces such as sequences of words. In

the case of statistical language models, the most success-
In recent years, variants of a neural network ar- ful generalization principle (and corresponding notion of
chitecture for statistical language modeling have similarity) is also a very simple one, and it is used in in-
been proposed and successfully applied, e.g. in terpolated and back-off n-gram models (Jelinek and Mer-
the language modeling component of speech rec- cer, 1980; Katz, 1987): sequences that share shorter subse-
ognizers. The main advantage of these architec- quences are similar and should share probability mass.
tures is that they learn an embedding for words However, these methods are based on exact matching
(or other symbols) in a continuous space that of subsequences, whereas it is obvious that two word se-
helps to smooth the language model and pro- quences may not match and yet be very close to each other
vide good generalization even when the num- semantically. Taking this into account, another principle
ber of training examples is insufficient. How- that has been shown to be very successful (in combina-
ever, these models are extremely slow in com- tion with the first one) is based on a notion of similarity
parison to the more commonly used n-gram mod- between individual words: two word sequences are said
els, both for training and recognition. As an al- to be “similar” if their corresponding words are “similar”.
ternative to an importance sampling method pro- Similarity between words is usually defined using word
posed to speed-up training, we introduce a hier- classes (Brown et al., 1992; Goodman, 2001b). These
archical decomposition of the conditional proba- word classes correspond to a partition of the set of words
bilities that yields a speed-up of about 200 both in such a way that words in the same class share statisti-
during training and recognition. The hierarchical cal properties in the context of their use, and this partition
decomposition is a binary hierarchical cluster- can be obtained with various clustering algorithms. This is
ing constrained by the prior knowledge extracted a discrete all-or-nothing notion of similarity. Another way
from the WordNet semantic hierarchy. to define similarity between words is based on assigning
each word to a continuous-valued set of features, and com-
paring words based on this feature vector. This idea has
already been exploited in information retrieval (Schutze,
1 INTRODUCTION 1993; Deerwester et al., 1990) using a singular value de-
composition of a matrix of occurrences, indexed by words
The curse of dimensionality hits hard statistical language in one dimension and by documents in the other.
models because the number of possible combinations of n
This idea has also been exploited in (Bengio, Ducharme
words from a dictionnary (e.g. of 50,000 words) is im-
and Vincent, 2001; Bengio et al., 2003): a neural network
mensely larger than all the text potentially available, at least
architecture is defined in which the first layer maps word
for n > 2. The problem comes down to transfering prob-
symbols to their continuous representation as feature vec-
ability mass from the tiny fraction of observed cases to all
tors, and the rest of the neural network is conventional and
the other combinations. From the point of view of machine
used to construct conditional probabilities of the next word
learning, it is interesting to consider the different principles
given the previous ones. This model is described in de-
at work in obtaining such generalization. The most funda-
tail in Section 2. The idea is to exploit the smoothness of
mental principle, used explicitly in non-parametric mod-
the neural network to make sure that sequences of words
els, is that of similarity: if two objects are similar they
that are similar according to this learned metric will be as-
should have a similar probability. Unfortunately, using a
signed a similar probability. Note that both the feature vec-
knowledge-free notion of similarity does not work well in
tors and the part of the model that computes probabilities Net lexical reference system to help define the hierarchy of
from them are estimated jointly, by regularized maximum word classes.
likelihood. This type of model is also related to the popular
maximum entropy models (Berger, Della Pietra and Della
Pietra, 1996) since the latter correspond to a neural network 2 PROBABILISTIC NEURAL
with no hidden units (the unnormalized log-probabilities LANGUAGE MODEL
are linear functions of the input indicators of presence of
words).
The objective is to estimate the joint probability of se-
This neural network approach has been shown to gener- quences of words and we do it through the estimation of the
alize well in comparison to interpolated n-gram models and conditional probability of the next word (the target word)
class-based n-grams (Brown et al., 1992; Pereira, Tishby given a few previous words (the context):
and Lee, 1993; Ney and Kneser, 1993; Niesler, Whittaker Y
and Woodland, 1998; Baker and McCallum, 1998), both P (w1 , . . . , wl ) = P (wt |wt−1 , . . . , wt−n+1 ),
in terms of perplexity and in terms of classification error t

when used in a speech recognition system (Schwenk and where wt is the word at position t in a text and wt ∈ V ,
Gauvain, 2002; Schwenk, 2004; Xu, Emami and Jelinek, the vocabulary. The conditional probability is estimated
P a normalized function f (wt , wt−1 , . . . , wt−n+1 ), with
2003). In (Schwenk and Gauvain, 2002; Schwenk, 2004), by
it is shown how the model can be used to directly improve v f (v, wt−1 , . . . , wt−n+1 ) = 1.
speech recognition performance. In (Xu, Emami and Je-
In (Bengio, Ducharme and Vincent, 2001; Bengio et al.,
linek, 2003), the approach is generalized to form the vari-
2003) this conditional probability function is represented
ous conditional probability functions required in a stochas-
by a neural network with a particular structure. Its most
tic parsing model called the Structured Language Model,
important characteristic is that each input of this function
and experiments also show that speech recognition perfor-
(a word symbol) is first embedded into a Euclidean space
mance can be improved over state-of-the-art alternatives.
(by learning to associate a real-valued “feature vector” to
However, a major weakness of this approach is the very
each word). The set of feature vectors for all the words
long training time as well as the large amount of compu-
in V is part of the set of parameters of the model, esti-
tations required to compute probabilities, e.g. at the time
mated by maximizing the empirical log-likelihood (minus
of doing speech recognition (or any other application of
weight decay regularization). The idea of associating each
the model). Note that such models could be used in other
symbol with a distributed continuous representation is not
applications of statistical language modeling, such as auto-
new and was advocated since the early days of neural net-
matic translation and information retrieval, but improving
works (Hinton, 1986; Elman, 1990). The idea of using neu-
speed is important to make such applications possible.
ral networks for language modeling is not new either (Mi-
The objective of this paper is thus to propose a ikkulainen and Dyer, 1991; Xu and Rudnicky, 2000), and
much faster variant of the neural probabilistic language is similar to proposals of character-based text compression
model. It is based on an idea that could in principle using neural networks to predict the probability of the next
deliver close to exponential speed-up with respect to the character (Schmidhuber, 1996).
number of words in the vocabulary. Indeed the computa-
There are two variants of the model in (Bengio,
tions required during training and during probability pre-
Ducharme and Vincent, 2001; Bengio et al., 2003): one
diction are a small constant plus a factor linearly propor-
with |V | outputs with softmax normalization (and the target
tional to the number of words |V | in the vocabulary V .
word wt is not mapped to a feature vector, only the context
The approach proposed here can yield a speed-up of order
words), and one with a single output which represents the
O( log|V|V| | ) for the second term. It follows up on a proposal
unnormalized probability for wt given the context words.
made in (Goodman, 2001b) to rewrite a probability func- Both variants gave similar performance in the experiments
tion based on a partition of the set of words. The basic reported in (Bengio, Ducharme and Vincent, 2001; Bengio
idea is to form a hierarchical description of a word as a se- et al., 2003). We will start from the second variant here,
quence of O(log |V |) decisions, and to learn to take these which can be formalized as follows, using the Boltzmann
probabilistic decisions instead of directly predicting each distribution form, following (Hinton, 2000):
word’s probability. Another important idea of this paper
is to reuse the same model (i.e. the same parameters) for
e−g(wt ,wt−1 ,...,wt−n+1 )
all those decisions (otherwise a very large number of mod- f (wt , wt−1 , . . . , wt−n+1 ) = P −g(v,w ,...,w
t−n+1 )
ve
t−1
els would be required and the whole model would not fit
in computer memory), using a special symbolic input that
characterizes the nodes in the tree of the hierarchical de- where g(v, wt−1 , . . . , wt−n+1 ) is a learned function that
composition. Finally, we use prior knowledge in the Word- can be interpreted as an energy, which is low when the tuple
(v, wt−1 , . . . , wt−n+1 ) is “plausible”.
Let F be an embedding matrix (a parameter) with row i, X)P (C = i|X) = P (Y |C = c(Y ), X)P (C =
Fi the embedding (feature vector) for word i. The above c(Y )|X) because only one value of C is compatible with
energy function is represented by a first transformation of the value of Y , the value C = c(Y ).
the input label through the feature vectors Fi , followed by
Although any c(.) would yield correct probabilities, gen-
an ordinary feedforward neural network (with a single out-
eralization could be better for choices of word classes that
put and a bias dependent on v):
“make sense”, i.e. those for which it easier to learn the
P (C = c(y)|X = x).
g(v, wt−1 , . . . , wt−n+1 ) = a′ . tanh(c + W x + U Fv′ ) + bv If Y can take 10000 values and we have 100 classes with
(1) 100 words y in each class, then instead of doing normaliza-
tion over 10000 choices we only need to do two normal-
where x′ denotes the transpose of x, tanh is applied ele-
izations, each over 100 choices. If computation of condi-
ment by element, a, c and b are parameters vectors, W and
tional probabilities is proportional to the number of choices
U are weight matrices (also parameters), and x denotes the
then the above would reduce computation by a factor 50.
concatenation of input feature vectors for context words:
This is approximately what is gained according to the mea-
surements reported in (Goodman, 2001b). The same pa-
x = (Fwt−1 , . . . , Fwt−n+1 )′ . (2) per suggests that one could introduce more levels to the
decomposition and here we push this idea to the limit. In-
Let h be the number of hidden units (the number of rows deed, whereas a one-level decomposition should provide a
of W ) and d the dimension of the embedding (number speed-up on the order of √|V | = |V |, a hierarchical de-
p
of columns of F ). Computing f (wt , wt−1 , . . . , wt−n+1 ) |V |

can be done in two steps: first compute c + W x (requires composition represented by a balanced binary tree should
hd(n − 1) multiply-add operations), and second, for each provide an exponential speed-up, on the order of log|V |V |
|
2

v ∈ V , compute U Fv′ (hd multiply-add operations) and (at least for the part of the computation that is linear in the
the value of g(v, ...) (h multiply-add operations). Hence number of choices).
total computation time for computing f is on the order of Each word v must be represented by a bit vector
(n − 1)hd + |V |h(d + 1). In the experiments reported (b1 (v), . . . bm (v)) (where m depends on v). This can be
in (Bengio et al., 2003), n is around 5, |V | is around 20000, achieved by building a binary hierarchical clustering of
h is around 100, and d is around 30. This gives around words, and a method for doing so is presented in the next
12000 operations for the first part (independent of |V |) and section. For example, b1 (v) = 1 indicates that v belongs
around 60 million operations for the second part (that is to the top-level group 1 and b2 (v) = 0 indicates that it be-
linear in |V |). longs to the sub-group 0 of that top-level group.
Our goal in this paper is to drastically reduce the sec- The next-word conditional probability can thus be rep-
ond part, ideally by replacing the O(|V |) computations by resented and computed as follows:
O(log |V |) computations.

P (v|wt−1 , . . . , wt−n+1 ) =
3 HIERARCHICAL DECOMPOSITION Ym

CAN PROVIDE EXPONENTIAL P (bj (v)|b1 (v), . . . , bj−1 (v), wt−1 , . . . , wt−n+1 )
j=1
SPEED-UP
This can be interpreted as a series of binary stochastic
In (Goodman, 2001b) it is shown how to speed-up a max- decisions associated with nodes of a binary tree. Each node
imum entropy class-based statistical language model by is indexed by a bit vector corresponding to the path from
using the following idea. Instead of computing directly the root to the node (append 1 or 0 according to whether the
P (Y |X) (which involves normalization across all the val- left or right branch of a decision node is followed). Each
ues that Y can take), one defines a clustering partition for leaf corresponds to a word. If the tree is balanced then the
the Y (into the word classes C, such that there is a deter- maximum length of the bit vector is ⌈log2 |V |⌉. Note that
ministic function c(.) mapping Y to C), so as to write we could further reduce computation by looking for an en-
coding that takes the frequency of words into account, to
reduce the average bit length to the unconditional entropy
P (Y = y|X = x) = of words. For example with the corpus used in our experi-
P (Y = y|C = c(y), X)P (C = c(y)|X = x). ments, |V | = 10000 so log2 |V | ≈ 13.3 while the unigram
entropy is about 9.16, i.e. a possible additional speed-up
P true for any functionPc(.) because
This is always of 31% when taking word frequencies into account to bet-
P (Y |X) = i P (Y, C = i|X) = i P (Y |C = ter balance the binary tree. The gain would be greater for
larger vocabularies, but not a very significant improvement sense to force the word embedding to be shared across all
over the major one obtained by using a simple balanced nodes. This is important also because the matrix of word
hierarchy. features F is the largest component of the parameter set.
The “target class” (0 or 1) for each node is obtained di- Since each node in the hierarchy presumably has a se-
rectly from the target word in each context, using the bit mantic meaning (being associated with a group of hope-
encoding of that word. Note also that there will be a target fully similar-meaning words) it makes sense to also as-
(and gradient propagation) only for the nodes on the path sociate each node with a feature vector. Without loss
from the root to the leaf associated with the target word. of generality, we can consider the model to predict
This is the major source of savings during training. P (b|node, wt−1 , . . . , wt−n+1 ) where node corresponds to
a sequence of bits specifying a node in the hierarchy and b
During recognition and testing, there are two main cases
is the next bit (0 or 1), corresponding to one of the two chil-
to consider: one needs the probability of only one word,
dren of node. This can be represented by a model similar to
e.g. the observed word, (or very few) , or one needs the
the one described in Section 2 and (Bengio, Ducharme and
probabilities of all the words. In the first case (which oc-
Vincent, 2001; Bengio et al., 2003) but with two kinds of
curs during testing on a corpus) we still obtain the exponen-
symbols in input: the context words and the current node.
tial speed-up. In the second case, we are back to O(|V |)
We allow the embedding parameters for word cluster nodes
computations (with a constant factor overhead). For the
to be different from those for words. Otherwise the archi-
purpose of estimating generalization performance (out-of-
tecture is the same, with the difference that there are only
sample log-likelihood) only the probability of the observed
two choices to predict, instead of |V | choices.
next word is needed. And in practical applications such as
speech recognition, we are only interested in discriminat- More precisely, the specific predictor used in our exper-
ing between a few alternatives, e.g. those that are consistent iments is the following:
with the acoustics, and represented in a treillis of possible
word sequences.
This speed-up should be contrasted with the one P (b = 1|node, wt−1 , . . . , wt−n+1 ) =
provided by the importance sampling method proposed sigmoid(αnode + β ′ . tanh(c + W x + U Nnode))
in (Bengio and Senécal, 2003). The latter method is based
on the observation that the log-likelihood gradient is the where x is the concatenation of context word features
average over the model’s distribution for P (v|context) of as in eq. 2, sigmoid(y) = 1/(1 + exp(−y)), αi is a bias
the energy gradient associated with all the possible next- parameter playing the same role as bv in eq. 1, β is a weight
words v. The idea is to approximate this average by a vector playing the same role as a in eq. 1, c, W , U and F
biased (but asymptotically unbiased) importance sampling play the same role as in eq. 1, and N gives feature vector
scheme. This approach can lead to significant speed-up embeddings for nodes in a way similar that F gave feature
during training, but because the architecture is unchanged, vector embeddings for next-words in eq. 1.
probability computation during recognition and test still re-
quires O(|V |) computations for each prediction. Instead,
the architecture proposed here gives significant speed-up 5 USING WORDNET TO BUILD THE
both during training and test / recognition. HIERARCHICAL DECOMPOSITION

4 SHARING PARAMETERS ACROSS A very important component of the whole model is the
THE HIERARCHY choice of the words binary encoding, i.e. of the hierar-
chical word clustering. In this paper we combine empir-
ical statistics with prior knowledge from the WordNet re-
If a separate predictor is used for each of the nodes in the source (Fellbaum, 1998). Another option would have been
hierarchy, about 2|V | predictors will be needed. This rep- to use a purely data-driven hierarchical clustering of words,
resents a huge capacity since each predictor maps from the and there are many other ways in which the WordNet re-
context words to a single probability. This might create source could have been used to influence the resulting clus-
problems in terms of computer memory (not all the models tering.
would fit at the same time in memory) as well as overfitting.
The IS-A taxonomy in WordNet organizes semantic
Therefore we have chosen to build a model in which pa-
concepts associated with senses in a graph that is almost a
rameters are shared across the hierarchy. There are clearly
tree. For our purposes we need a tree, so we have manually
many ways to achieve such sharing, and alternatives to the
selected a parent for each of the few nodes that have more
architecture presented here should motivate further study.
than one parent. The leaves of the WordNet taxonomy are
Based on our discussion in the introduction, it makes senses and each word can be associated with more than one
sense. Words sharing the same sense are considered to be
synonymous (at least in one of their uses). For our pur-
pose we have to choose one of the senses for each word
(to make the whole hierarchy one over words) and we se-
lected the most frequent sense. A straightforward extension
of the proposed model would keep the semantic ambiguity
of each word: each word would be associated with sev-
eral leaves (senses) of the WordNet hierarchy. This would
require summing over all those leaves (and corresponding
paths to the root) when computing next-word probabilities.
Note that the WordNet tree is not binary: each node
may have many more than two children (this is particu-
larly a problem for verbs and adjectives, for which Word-
Net is shallow and incomplete). To transform this hierar-
chy into a binary tree we perform a data-driven binary hi-
erarchical clustering of the children associated with each
node, as illustrated in Figure 1. The K-means algorithm is
used at each step to split each cluster. To compare nodes,
we associate each node with the subset of words that it
covers. Each word is associated with a TF/IDF (Salton
111
000 111
000
and Buckley, 1988) vector of document/word occurrence
counts, where each “document” is a paragraph in the train-
000
111
000
111 000
111
000
111
000
111 000
111
ing corpus. Each node is associated with the dimension-
wise median of the TF/IDF scores. Each TF/IDF score
000
111 000
111
is the occurrence frequency of the word in the document
times the logarithm of the ratio of the total number of doc-
uments by the number of documents containing the word.

6 COMPARATIVE RESULTS
Figure 1: WordNet’s IS-A hierarchy is not a binary tree:
Experiments were performed to evaluate the speed-up and most nodes have many children. Binary hierarchical clus-
any change in generalization error. The experiments also tering of these children is performed.
compared an alternative speed-up technique (Bengio and
Senécal, 2003) that is based on importance sampling (but
only provides a speed-up during training). The experiments and that the model generalizes well. As usual in statis-
were performed on the Brown corpus, with a reduced vo- tical language modeling this is measured by the model’s
cabulary size of 10,000 words (the most frequent ones). perplexity on the test data, which is the exponential of
The corpus has 1,105,515 occurrences of words, split into the average negative log-likehood on that data set. Train-
3 sets: 900,000 for training, 100,000 for validation (model ing is performed over about 20 to 30 epochs according to
selection), and 105,515 for testing. The validation set was validation set perplexity (early stopping). Table 3 shows
used to select among a small number of choices for the size the comparative generalization performance of the differ-
of the embeddings and the number of hidden units. ent architectures, along with that of an interpolated trigram
and a class-based n-gram (same procedures as in (Bengio
et al., 2003), which follow respectively (Jelinek and Mer-
The results in terms of raw computations (time to pro-
cer, 1980) and (Brown et al., 1992; Ney and Kneser, 1993;
cess one example), either during training or during test
Niesler, Whittaker and Woodland, 1998)). The validation
are shown respectively in Tables 1 and 2. The computa-
set was used to choose the order of the n-gram and the
tions were performed on Athlon processors with a 1.2 GHz
number of word classes for the class-based models. We
clock. The speed-up during training is by a factor greater
used the implementation of these algorithms in the SRI
than 250 and during test by a factor close to 200. These are
Language Modeling toolkit, described by (Stolcke, 2002)
impressive but less than the |V |/ log2 |V | ≈ 750 that could
and in www.speech.sri.com/projects/srilm/.
be expected if there was no overhead and no constant term
Note that better performance should be obtainable with
in the computational cost.
some of the tricks in (Goodman, 2001a). Combining the
It is also important to verify that learning still works neural network with a trigram should also decrease its per-
Time per Time per speed-up Validation Test
architecture epoch (s) ex. (ms) perplexity perplexity
original neural net 416 300 462.6 1 trigram 299.4 268.7
importance sampling 6 062 6.73 68.7 class-based 276.4 249.1
hierarchical model 1 609 1.79 258 original neural net 213.2 195.3
importance sampling 209.4 192.6
Table 1: Training time per epoch (going once through all hierarchical model 241.6 220.7
the training examples) and per example. The original neu-
ral net is as described in sec. 2. The importance sam- Table 3: Test perplexity for the different architectures and
pling algorithm (Bengio and Senécal, 2003) trains the same for an interpolated trigram. The hierarchical model per-
model faster. The hierarchical model is the one proposed formed a bit worse than the original neural network, but
here, and it yields a speed-up not only during training but is still better than the baseline interpolated trigram and the
for probability predictions as well (see the next table). class-based model.

Time per speed-up From a linguistic point of view, one of the weaknesses
architecture example (ms) of the above model is that it considers word clusters as de-
original neural net 270.7 1 terministic functions of the word, but uses the nodes in
importance sampling 221.3 1.22 WordNet’s taxonomy to help define those clusters. How-
hierarchical model 1.4 193 ever, WordNet provides word sense ambiguity information
which could be used for linguistically more accurate mod-
Table 2: Test time per example for the different algorithms. eling. The hierarchy would be a sense hierarchy instead of
See Table 1’s caption. It is at test time that the hierarchical a word hiearchy, and each word would be associated with
model’s advantage becomes clear in comparison to the im- a number of senses (those allowed for that word in Word-
portance sampling technique, since the latter only brings a Net). In computing probabilities, this would involve sum-
speed-up during training. ming over several paths from the root, corresponding to the
different possible senses of the word. As a side effect, this
could provide a word sense disambiguation model, and it
plexity, as already shown in (Bengio et al., 2003). could be trained both on sense-tagged supervised data and
on unlabeled ordinary text. Since the average number of
As shown in Table 3, the hierarchical model does not
senses per word is small (less than a handful), the loss in
generalize as well as the original neural network, but the
speed would correspondingly be small.
difference is not very large and still represents an improve-
ment over the benchmark n-gram models. Given the very
large speed-up, it is certainly worth investigating variations Acknowledgments
of the hierarchical model proposed here (in particular how
to define the hierarchy) for which generalization could be The authors would like to thank the following funding or-
better. Note also that the speed-up would be greater for ganizations for support: NSERC, MITACS, IRIS, and the
larger vocabularies (e.g. 50,000 is not uncommon in speech Canada Research Chairs.
recognition systems).

References
7 CONCLUSION AND FUTURE WORK
Baker, D. and McCallum, A. (1998). Distributional clus-
tering of words for text classification. In SIGIR’98.
This paper proposes a novel architecture for speeding-up
neural networks with a huge number of output classes and Bengio, Y., Ducharme, R., and Vincent, P. (2001). A neu-
shows its usefulness in the context of statistical language ral probabilistic language model. In Leen, T., Diet-
modeling (which is a component of speech recognition and terich, T., and Tresp, V., editors, Advances in Neural
automatic translation systems). This work pushes to the Information Processing Systems 13 (NIPS’00), pages
limit a suggestion of (Goodman, 2001b) but also intro- 933–938. MIT Press.
duces the idea of sharing the same model for all nodes of Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C.
the decomposition, which is more practical when the num- (2003). A neural probabilistic language model. Jour-
ber of nodes is very large (tens of thousands here). The nal of Machine Learning Research, 3:1137–1155.
implementation and the experiments show that a very sig-
nificant speed-up of around 200-fold can be achieved, with Bengio, Y. and Senécal, J.-S. (2003). Quick training of
only a little degradation in generalization performance. probabilistic neural nets by importance sampling. In
Proceedings of AISTATS 2003. Niesler, T. R., Whittaker, E. W. D., and Woodland, P. C.
(1998). Comparison of part-of-speech and automat-
Berger, A. L., Della Pietra, V. J., and Della Pietra, S. A. ically derived category-based language models for
(1996). A maximum entropy approach to natural lan- speech recognition. In International Conference on
guage processing. Computational Linguistics, 22:39– Acoustics, Speech and Signal Processing (ICASSP),
71. pages 177–180.
Brown, P. F., Pietra, V. J. D., DeSouza, P. V., Lai, J. C., Pereira, F., Tishby, N., and Lee, L. (1993). Distributional
and Mercer, R. L. (1992). Class-based n-gram mod- clustering of english words. In 30th Annual Meet-
els of natural language. Computational Linguistics, ing of the Association for Computational Linguistics,
18:467–479. pages 183–190, Columbus, Ohio.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, Salton, G. and Buckley, C. (1988). Term weighting ap-
T. K., and Harshman, R. (1990). Indexing by latent proaches in automatic text retrieval. Information Pro-
semantic analysis. Journal of the American Society cessing and Management, 24(5):513–523.
for Information Science, 41(6):391–407.
Schmidhuber, J. (1996). Sequential neural text com-
Elman, J. L. (1990). Finding structure in time. Cognitive pression. IEEE Transactions on Neural Networks,
Science, 14:179–211. 7(1):142–146.

Fellbaum, C. (1998). WordNet: An Electronic Lexical Schutze, H. (1993). Word space. In Giles, C., Hanson, S.,
Database. MIT Press. and Cowan, J., editors, Advances in Neural Informa-
tion Processing Systems 5 (NIPS’92), pages 895–902,
Goodman, J. (2001a). A bit of progress in language mod- San Mateo CA. Morgan Kaufmann.
eling. Technical Report MSR-TR-2001-72, Microsoft
Research, Redmond, Washington. Schwenk, H. (2004). Efficient training of large neural net-
works for language modeling. In International Joint
Goodman, J. (2001b). Classes for fast maximum entropy Conference on Neural Networks (IJCNN), volume 4,
training. In International Conference on Acoustics, pages 3050–3064.
Speech and Signal Processing (ICASSP), Utah.
Schwenk, H. and Gauvain, J.-L. (2002). Connectionist
Hinton, G. E. (1986). Learning distributed representations language modeling for large vocabulary continuous
of concepts. In Proceedings of the Eighth Annual speech recognition. In International Conference on
Conference of the Cognitive Science Society, pages 1– Acoustics, Speech and Signal Processing (ICASSP),
12, Amherst 1986. Lawrence Erlbaum, Hillsdale. pages 765–768, Orlando, Florida.

Hinton, G. E. (2000). Training products of experts by Stolcke, A. (2002). SRILM - an extensible language mod-
minimizing contrastive divergence. Technical Report eling toolkit. In Proceedings of the International Con-
GCNU TR 2000-004, Gatsby Unit, University Col- ference on Statistical Language Processing, Denver,
lege London. Colorado.

Jelinek, F. and Mercer, R. L. (1980). Interpolated estima- Xu, P., Emami, A., and Jelinek, F. (2003). Train-
tion of Markov source parameters from sparse data. ing connectionist models for the structured language
In Gelsema, E. S. and Kanal, L. N., editors, Pattern model. In Proceedings of the 2003 Conference on
Recognition in Practice. North-Holland, Amsterdam. Empirical Methods in Natural Language Processing
(EMNLP’2003), volume 10, pages 160–167.
Katz, S. M. (1987). Estimation of probabilities from sparse
data for the language model component of a speech Xu, W. and Rudnicky, A. (2000). Can artificial neural
recognizer. IEEE Transactions on Acoustics, Speech, networks learn language models. In International
and Signal Processing, ASSP-35(3):400–401. Conference on Statistical Language Processing, pages
M1–13, Beijing, China.
Miikkulainen, R. and Dyer, M. G. (1991). Natural lan-
guage processing with modular PDP networks and
distributed lexicon. Cognitive Science, 15:343–399.

Ney, H. and Kneser, R. (1993). Improved clustering tech-

niques for class-based statistical language modelling.
In European Conference on Speech Communication
and Technology (Eurospeech), pages 973–976, Berlin.

A Neural Probabilistic Language Model by Yoshua Bengio Ducharme and Vincent 2001
No ratings yet
A Neural Probabilistic Language Model by Yoshua Bengio Ducharme and Vincent 2001
7 pages
2003-A Neural Probabilistic Language Model 副本
No ratings yet
2003-A Neural Probabilistic Language Model 副本
19 pages
Word Embedding
No ratings yet
Word Embedding
9 pages
Neural Network Language Models Survey
No ratings yet
Neural Network Language Models Survey
7 pages
Enriching Word Vectors With Subword Information: Piotr Bojanowski
No ratings yet
Enriching Word Vectors With Subword Information: Piotr Bojanowski
7 pages
Deep Neural Network Language Models - W12-2703
No ratings yet
Deep Neural Network Language Models - W12-2703
9 pages
Advanced Text Representation Models
No ratings yet
Advanced Text Representation Models
9 pages
Paragraph Vector PDF
No ratings yet
Paragraph Vector PDF
9 pages
Course Probabilistic Language Models
No ratings yet
Course Probabilistic Language Models
3 pages
Intro DL 10 NLP
No ratings yet
Intro DL 10 NLP
99 pages
92 Y. Li and T. Yang: Fig. 4.5 (A) The Structure of The Recursive Neural Network Model Where Each Node Represents
No ratings yet
92 Y. Li and T. Yang: Fig. 4.5 (A) The Structure of The Recursive Neural Network Model Where Each Node Represents
13 pages
NLP Cache Model
No ratings yet
NLP Cache Model
9 pages
Word2vector Paper PDF
No ratings yet
Word2vector Paper PDF
9 pages
Lecture 3 - Language Modelling and RNNs Part 1
No ratings yet
Lecture 3 - Language Modelling and RNNs Part 1
44 pages
Word2Vec for NLP Enthusiasts
100% (1)
Word2Vec for NLP Enthusiasts
12 pages
Levy Improving Distributional
No ratings yet
Levy Improving Distributional
16 pages
LLM Book 43-102
No ratings yet
LLM Book 43-102
60 pages
Neural Embedding for NLP Experts
No ratings yet
Neural Embedding for NLP Experts
9 pages
Xu00b Icslp
No ratings yet
Xu00b Icslp
4 pages
L4 Cse256 Fa24 We
No ratings yet
L4 Cse256 Fa24 We
68 pages
Nn4nlp 02 LM
No ratings yet
Nn4nlp 02 LM
47 pages
2 Generative Models
No ratings yet
2 Generative Models
60 pages
NLP Word Vectors for Students
No ratings yet
NLP Word Vectors for Students
33 pages
NLP
No ratings yet
NLP
12 pages
Fast Text
No ratings yet
Fast Text
12 pages
DeepLearning ACL2012 Tutorial
No ratings yet
DeepLearning ACL2012 Tutorial
7 pages
5th Unit
No ratings yet
5th Unit
36 pages
Cs224n 2023 Lecture05 RNNLM
No ratings yet
Cs224n 2023 Lecture05 RNNLM
68 pages
NLP with Deep Learning for Students
No ratings yet
NLP with Deep Learning for Students
45 pages
A Latent Variable Model Approach To PMI-based Word Embeddings
No ratings yet
A Latent Variable Model Approach To PMI-based Word Embeddings
16 pages
Bagging - Boosting
No ratings yet
Bagging - Boosting
9 pages
Python Arrays and Operations - 27122018
No ratings yet
Python Arrays and Operations - 27122018
10 pages
A Givision 2011
No ratings yet
A Givision 2011
5 pages
FCN 29sep2018
No ratings yet
FCN 29sep2018
12 pages
Binary Decision Trees - 14122018
No ratings yet
Binary Decision Trees - 14122018
20 pages
CSM Syllabus A8607 - Information Security PE I
No ratings yet
CSM Syllabus A8607 - Information Security PE I
2 pages
CNN 1
No ratings yet
CNN 1
9 pages
Oussidi 2018
No ratings yet
Oussidi 2018
8 pages
Atal FDP IT
No ratings yet
Atal FDP IT
2 pages
Visuospatial Reasoning: January 2005
No ratings yet
Visuospatial Reasoning: January 2005
33 pages
Blockchain-Based Smart Contracts - Applications and Challenges
No ratings yet
Blockchain-Based Smart Contracts - Applications and Challenges
27 pages
Deep Learning Syllabus
No ratings yet
Deep Learning Syllabus
2 pages
Poly Book
No ratings yet
Poly Book
360 pages
SIMBA Smart Contract Designer Guide
No ratings yet
SIMBA Smart Contract Designer Guide
10 pages
2020 Draft Budgetary Plan of Estonia: Tallinn, 15. October 2019
No ratings yet
2020 Draft Budgetary Plan of Estonia: Tallinn, 15. October 2019
40 pages
Rejeski 2019
No ratings yet
Rejeski 2019
13 pages
Holistic Approach To Mental Health
No ratings yet
Holistic Approach To Mental Health
38 pages
Wellness and Spa Curriculum Schedule
No ratings yet
Wellness and Spa Curriculum Schedule
5 pages
Literature Review Software Tools
No ratings yet
Literature Review Software Tools
82 pages
How To Peer Review - Springer
No ratings yet
How To Peer Review - Springer
16 pages
2019 - Understanding Blockchain Technology For Future Supply Chains - A Systematic Literature Review and Research Agenda
No ratings yet
2019 - Understanding Blockchain Technology For Future Supply Chains - A Systematic Literature Review and Research Agenda
44 pages
Secure Your Supply Chain
No ratings yet
Secure Your Supply Chain
8 pages
Cyber Trademark Infringement & Meta Tag Abuse
100% (3)
Cyber Trademark Infringement & Meta Tag Abuse
26 pages
Dark Harmony Laura Thalassa Download
100% (1)
Dark Harmony Laura Thalassa Download
117 pages
Reinforced and Prestressed Concrete Design According To DIN 1045-1
No ratings yet
Reinforced and Prestressed Concrete Design According To DIN 1045-1
70 pages
IB Economics Revision Guide
No ratings yet
IB Economics Revision Guide
48 pages
iFIX to Proficy Portal Guide
No ratings yet
iFIX to Proficy Portal Guide
21 pages
Water Services Pipe Sizing
No ratings yet
Water Services Pipe Sizing
15 pages
2463 Discovery
No ratings yet
2463 Discovery
71 pages
A Practical Guide To Dragon Magic
100% (1)
A Practical Guide To Dragon Magic
88 pages
Template - CV Akademik S2-S3 (English)
No ratings yet
Template - CV Akademik S2-S3 (English)
2 pages
Sticking Combinaties Triolen PDF
100% (2)
Sticking Combinaties Triolen PDF
35 pages
Manual de Partes ATM s22 NCR
100% (1)
Manual de Partes ATM s22 NCR
190 pages
A Summer Training Project Report
No ratings yet
A Summer Training Project Report
12 pages
DAA and Web Design Lab Section 1 Question 1
No ratings yet
DAA and Web Design Lab Section 1 Question 1
7 pages
Escape Rooms: Engaging Education & Design
No ratings yet
Escape Rooms: Engaging Education & Design
1 page
Std. 7. L 6 Ques Ans
No ratings yet
Std. 7. L 6 Ques Ans
6 pages
Grade 7 Classroom Inventory 2024-2025
No ratings yet
Grade 7 Classroom Inventory 2024-2025
3 pages
Toaz - Info Cambridge Primary Progression Test Stage 4 Math Paper 2 1pdf PR - 1553255
No ratings yet
Toaz - Info Cambridge Primary Progression Test Stage 4 Math Paper 2 1pdf PR - 1553255
10 pages
K Alka Dental College
No ratings yet
K Alka Dental College
5 pages
Hamilton LRT B-Line Cost Estimate
No ratings yet
Hamilton LRT B-Line Cost Estimate
12 pages
Bkash Technical
No ratings yet
Bkash Technical
8 pages
Ap Microeconomics 2017 International Practice Exam FRQ
No ratings yet
Ap Microeconomics 2017 International Practice Exam FRQ
20 pages
Homo Heidelberge Nsis: Anatomically Modern Human
No ratings yet
Homo Heidelberge Nsis: Anatomically Modern Human
41 pages
Ng3 120 Lec Week 13 Activity 6 Tirona
No ratings yet
Ng3 120 Lec Week 13 Activity 6 Tirona
10 pages
HVP Product Sheet - Fireshutters
No ratings yet
HVP Product Sheet - Fireshutters
2 pages
Themida - Winlicense 1.x - 2.x Imports Fixer Edition 1.0 by SND
No ratings yet
Themida - Winlicense 1.x - 2.x Imports Fixer Edition 1.0 by SND
48 pages
Group Assignment
No ratings yet
Group Assignment
10 pages
Class 8 Reported Speech
No ratings yet
Class 8 Reported Speech
36 pages
MCQ Anatomy
100% (2)
MCQ Anatomy
48 pages
University Tuition and Programs Guide
No ratings yet
University Tuition and Programs Guide
6 pages

Hierarchical NNLM Aistats05

Uploaded by

Hierarchical NNLM Aistats05

Uploaded by

Hierarchical Probabilistic Neural Network Language Model

Frederic Morin Yoshua Bengio

Abstract high-dimensional spaces such as sequences of words. In

Ney, H. and Kneser, R. (1993). Improved clustering tech-

You might also like