AC LSTMNeuralNetworkforTextClassification
AC LSTMNeuralNetworkforTextClassification
ingly applied to sentence modeling. For example, Figure 1: The architecture of C-LSTM for sentence modeling.
Tai et al. (2015) adjusted the standard LSTM to Blocks of the same color in the feature map layer and window
tree-structured topologies and obtained superior feature sequence layer corresponds to features for the same win-
results over a sequential LSTM on related tasks. dow. The dashed lines connect the feature of a window with the
In this paper, we stack CNN and LSTM in a source feature map. The final output of the entire model is the
unified architecture for semantic sentence mod- last hidden unit of LSTM.
eling. The combination of CNN and LSTM can
be seen in some computer vision tasks like image
3.1 N-gram Feature Extraction through
caption (Xu et al., 2015) and speech recogni-
Convolution
tion (Sainath et al., 2015). Most of these models
use multi-layer CNNs and train CNNs and RNNs The one-dimensional convolution involves a filter
separately or throw the output of a fully connected vector sliding over a sequence and detecting fea-
layer of CNN into RNN as inputs. Our approach is tures at different positions. Let xi ∈ Rd be the
different: we apply CNN to text data and feed con- d-dimensional word vectors for the i-th word in a
secutive window features directly to LSTM, and so sentence. Let x ∈ RL×d denote the input sentence
our architecture enables LSTM to learn long-range where L is the length of the sentence. Let k be the
dependencies from higher-order sequential fea- length of the filter, and the vector m ∈ Rk×d is a fil-
tures. In (Li et al., 2015), the authors suggest that ter for the convolution operation. For each position
sequence-based models are sufficient to capture the j in the sentence, we have a window vector wj with
compositional semantics for many NLP tasks, thus k consecutive word vectors, denoted as:
in this work the CNN is directly built upon word
wj = [xj , xj+1 , · · · , xj+k−1 ] (1)
sequences other than the syntactic parse tree. Our
experiments on sentiment classification and 6-way
Here, the commas represent row vector concatena-
question classification tasks clearly demonstrate the
tion. A filter m convolves with the window vectors
superiority of our model over single CNN or LSTM
(k-grams) at each position in a valid way to gener-
model and other related sequence-based models.
ate a feature map c ∈ RL−k+1 ; each element cj of
the feature map for window vector wj is produced
3 C-LSTM Model as follows:
features after convolution and the sequence of win- 6 Results and Model Analysis
dow representations is fed into LSTM. For the latter
case, since the number of windows generated from In this section, we show our evaluation results on
each convolution layer varies when the filter length sentiment classification and question type classifica-
varies (see L − k + 1 below equation (3)), we cut the tion tasks. Moreover, we give some model analysis
window sequence at the end based on the maximum on the filter size configuration.
filter length that gives the shortest number of win-
dows. Each window is represented as the concatena- 6.1 Sentiment Classification
tion of outputs from different convolutional layers. The results are shown in Table 1. We compare our
We also exploit different combinations of different model with a large set of well-performed models on
filter lengths. We further present experimental anal- the Stanford Sentiment Treebank.
ysis of the exploration on filter size later. According Generally, the baseline models consist of recur-
to the experiments, we choose a single convolutional sive models, convolutional neural network mod-
layer with filter length 3. els, LSTM related models and others. The re-
For SST, the number of filters of length 3 is set to cursive models employ a syntactic parse tree as
be 150 and the memory dimension of LSTM is set the sentence structure and the sentence representa-
to be 150, too. The word vector layer and the LSTM tion is computed recursively in a bottom-up man-
layer are dropped out with a probability of 0.5. For ner along the parse tree. Under this category, we
TREC, the number of filters is set to be 300 and the choose recursive autoencoder (RAE), matrix-vector
memory dimension is set to be 300. The word vec- (MV-RNN), tensor based composition (RNTN) and
tor layer and the LSTM layer are dropped out with multi-layer stacked (DRNN) recursive neural net-
a probability of 0.5. We also add L2 regularization work as baselines. Among CNNs, we compare with
with a factor of 0.001 to the weights in the softmax Kim’s (2014) CNN model with fine-tuned word vec-
layer for both tasks. tors (CNN-non-static) and multi-channels (CNN-
multichannel), DCNN with dynamic k-max pool-
Model Acc Reported in
SVM 95.0 Silva et al .(2011)
Paragraph Vector 91.8 Zhao et al .(2015)
Ada-CNN 92.4 Zhao et al .(2015)
CNN-non-static 93.6 Kim (2014)
CNN-multichannel 92.2 Kim (2014)
DCNN 93.0 Kalchbrenner et al. (2014)
LSTM 93.2 our implementation
Bi-LSTM 93.0 our implementation
C-LSTM 94.6 our implementation
Table 2: The 6-way question type classification accuracy on TREC.
ing, Tao’s CNN (Molding-CNN) with low-rank ten- linear feature mapping functions or appealing to
sor based non-linear and non-consecutive convo- tree-structured topologies before the convolutional
lutions. Among LSTM related models, we first layer.
compare with two tree-structured LSTM models
(Dependence Tree-LSTM and Constituency Tree- 6.2 Question Type Classification
LSTM) that adjust LSTM to tree-structured network The prediction accuracy on TREC question classifi-
topologies. Then we implement one-layer LSTM cation is reported in Table 2. We compare our model
and Bi-LSTM by ourselves. Since we could not tune with a variety of models. The SVM classifier uses
the result of Bi-LSTM to be as good as what has unigrams, bigrams, wh-word, head word, POS tags,
been reported in (Tai et al., 2015) even if following parser, hypernyms, WordNet synsets as engineered
their untied weight configuration, we report our own features and 60 hand-coded rules. Ada-CNN is a
results. For other baseline methods, we compare self-adaptiive hierarchical sentence model with gat-
against SVM with unigram and bigram features, ing networks. Other baseline models have been in-
NBoW with average word vector features and para- troduced in the last task. From Table 2, we have the
graph vector that infers the new paragraph vector following observations: (1) Our result consistently
for unseen documents. outperforms all published neural baseline models,
To the best of our knowledge, we achieve the which means that C-LSTM captures intentions of
fourth best published result for the 5-class classi- TREC questions well. (2) Our result is close to that
fication task on this dataset. For the binary clas- of the state-of-the-art SVM that depends on highly
sification task, we achieve comparable results with engineered features. Such engineered features not
respect to the state-of-the-art ones. From Table 1, only demands human laboring but also leads to the
we have the following observations: (1) Although error propagation in the existing NLP tools, thus
we did not beat the state-of-the-art ones, as an end- couldn’t generalize well in other datasets and tasks.
to-end model, the result is still promising and com- With the ability of automatically learning semantic
parable with thoes models that heavily rely on lin- sentence representations, C-LSTM doesn’t require
guistic annotations and knowledge, especially syn- any human-designed features and has a better scali-
tactic parse trees. This indicates C-LSTM will be bility.
more feasible for various scenarios. (2) Compar-
ing our results against single CNN and LSTM mod- 6.3 Model Analysis
els shows that LSTM does learn long-term depen- Here we investigate the impact of different filter con-
dencies across sequences of higher-level represen- figurations in the convolutional layer on the model
tations better. We could explore in the future how performance.
to learn more compact higher-level representations In the convolutional layer of our model, filters are
by replacing standard convolution with other non- used to capture local n-gram features. Intuitively,
multiple convolutional layers in parallel with differ-
a convolutional layer; sequences of such higher-
0.950
level representations are then fed into the LSTM
0.945 to learn long-term dependencies. We evaluated the
learned semantic sentence representations on senti-
0.940 ment classification and question type classification
Accuracy
References
Figure 2: Prediction accuracies on TREC questions with dif-
ferent filter size strategies. For the horizontal axis, S means [Bastien et al.2012] Frédéric Bastien, Pascal Lamblin,
single convolutional layer with the same filter length, and M Razvan Pascanu, James Bergstra, Ian J. Goodfellow,
means multiple convolutional layers in parallel with different Arnaud Bergeron, Nicolas Bouchard, and Yoshua Ben-
filter lengths. gio. 2012. Theano: new features and speed im-
provements. Deep Learning and Unsupervised Fea-
ture Learning NIPS 2012 Workshop.
ent filter sizes should perform better than single con- [Cho et al.2014] Kyunghyun Cho, Bart Van Merriënboer,
volutional layers with the same length filters in that Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,
different filter sizes could exploit features of differ- Holger Schwenk, and Yoshua Bengio. 2014. Learn-
ent n-grams. However, we found in our experiments ing phrase representations using rnn encoder-decoder
for statistical machine translation. arXiv preprint
that single convolutional layer with filter length 3 al-
arXiv:1406.1078.
ways outperforms the other cases.
[Collobert et al.2011] Ronan Collobert, Jason Weston,
We show in Figure 2 the prediction accuracies on
Léon Bottou, Michael Karlen, Koray Kavukcuoglu,
the 6-way question classification task using differ- and Pavel Kuksa. 2011. Natural language process-
ent filter configurations. Note that we also observe ing (almost) from scratch. The Journal of Machine
the similar phenomenon in the sentiment classifica- Learning Research, 12:2493–2537.
tion task. For each filter configuration, we report in [Denil et al.2014] Misha Denil, Alban Demiraj, Nal
Figure 2 the best result under extensive grid-search Kalchbrenner, Phil Blunsom, and Nando de Freitas.
on hyperparameters. It it shown that single convolu- 2014. Modelling, visualising and summarising doc-
tional layer with filter length 3 performs best among uments with a single convolutional neural network.
all filter configurations. For the case of multiple arXiv preprint arXiv:1406.3830.
convolutional layers in parallel, it is shown that fil- [Devlin et al.2014] Jacob Devlin, Rabih Zbib,
Zhongqiang Huang, Thomas Lamar, Richard
ter configurations with filter length 3 performs better
Schwartz, and John Makhoul. 2014. Fast and
that those without tri-gram filters, which further con- robust neural network joint models for statistical
firms that tri-gram features do play a significant role machine translation. In Proceedings of the 52nd
in capturing local features in our tasks. We conjec- Annual Meeting of the Association for Computational
ture that LSTM could learn better semantic sentence Linguistics, volume 1, pages 1370–1380.
representations from sequences of tri-gram features. [Hinton et al.2012] Geoffrey E Hinton, Nitish Srivas-
tava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R
7 Conclusion and Future Work Salakhutdinov. 2012. Improving neural networks
by preventing co-adaptation of feature detectors. The
We have described a novel, unified model called C- Computing Research Repository (CoRR).
LSTM that combines convolutional neural network [Hochreiter and Schmidhuber1997] Sepp Hochreiter and
with long short-term memory network (LSTM). C- Jürgen Schmidhuber. 1997. Long short-term memory.
LSTM is able to learn phrase-level features through Neural computation, 9(8):1735–1780.
[Irsoy and Cardie2014] Ozan Irsoy and Claire Cardie. construct deep recurrent neural networks. In Proceed-
2014. Deep recursive neural networks for composi- ings of the conference on International Conference on
tionality in language. In Advances in Neural Informa- Learning Representations (ICLR).
tion Processing Systems, pages 2096–2104. [Sainath et al.2015] Tara N Sainath, Oriol Vinyals, An-
[Johnson and Zhang2015] Rie Johnson and Tong Zhang. drew Senior, and Hasim Sak. 2015. Convolutional,
2015. Effective use of word order for text categoriza- long short-term memory, fully connected deep neural
tion with convolutional neural networks. Human Lan- networks. IEEE International Conference on Acous-
guage Technologies: The 2015 Annual Conference of tics, Speech and Signal Processing.
the North American Chapter of the ACL, pages 103– [Silva et al.2011] Joao Silva, Luı́sa Coheur, Ana Cristina
112. Mendes, and Andreas Wichert. 2011. From symbolic
[Kalchbrenner et al.2014] Nal Kalchbrenner, Edward to sub-symbolic information in question classification.
Grefenstette, and Phil Blunsom. 2014. A convo- Artificial Intelligence Review, 35(2):137–154.
lutional neural network for modelling sentences. [Socher et al.2012] Richard Socher, Brody Huval,
Association for Computational Linguistics (ACL). Christopher D Manning, and Andrew Y Ng. 2012.
[Kim2014] Yoon Kim. 2014. Convolutional neural net- Semantic compositionality through recursive matrix-
works for sentence classification. In Proceedings of vector spaces. In Proceedings of Empirical Methods
Empirical Methods on Natural Language Processing. on Natural Language Processing, pages 1201–1211.
[Le and Mikolov2014] Quoc Le and Tomas Mikolov. [Socher et al.2013a] Richard Socher, John Bauer,
2014. Distributed representations of sentences and Christopher D Manning, and Andrew Y Ng. 2013a.
documents. In Proceedings of the 31st International Parsing with compositional vector grammars. In In
Conference on Machine Learning (ICML-14), pages Proceedings of the ACL conference. Citeseer.
1188–1196. [Socher et al.2013b] Richard Socher, Alex Perelygin,
Jean Y Wu, Jason Chuang, Christopher D Manning,
[Lei et al.2015] Tao Lei, Regina Barzilay, and Tommi
Andrew Y Ng, and Christopher Potts. 2013b. Recur-
Jaakkola. 2015. Molding cnns for text: non-linear,
sive deep models for semantic compositionality over
non-consecutive convolutions. In Proceedings of Em-
a sentiment treebank. In Proceedings of Empirical
pirical Methods on Natural Language Processing.
Methods on Natural Language Processing, volume
[Li and Roth2002] Xin Li and Dan Roth. 2002. Learn- 1631, page 1642. Citeseer.
ing question classifiers. In Proceedings of the 19th in-
[Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals, and
ternational conference on Computational linguistics-
Quoc VV Le. 2014. Sequence to sequence learning
Volume 1, pages 1–7. Association for Computational
with neural networks. In Advances in neural informa-
Linguistics.
tion processing systems, pages 3104–3112.
[Li et al.2015] Jiwei Li, Dan Jurafsky, and Eudard Hovy. [Tai et al.2015] Kai Sheng Tai, Richard Socher, and
2015. When are tree structures necessary for deep Christopher D Manning. 2015. Improved semantic
learning of representations? In Proceedings of Em- representations from tree-structured long short-term
pirical Methods on Natural Language Processing. memory networks. Association for Computational
[Mikolov et al.2013b] Tomas Mikolov, Ilya Sutskever, Linguistics (ACL).
Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. [Tang et al.2015] Duyu Tang, Bing Qin, and Ting Liu.
Distributed representations of words and phrases and 2015. Document modeling with gated recurrent neural
their compositionality. In Advances in neural infor- network for sentiment classification. In Proceedings
mation processing systems, pages 3111–3119. of Empirical Methods on Natural Language Process-
[Mou et al.2015] Lili Mou, Hao Peng, Ge Li, Yan Xu, ing.
Lu Zhang, and Zhi Jin. 2015. Discriminative [Tieleman and Hinton2012] T. Tieleman and G Hinton.
neural sentence modeling by tree-based convolution. 2012. Lecture 6.5 - rmsprop, coursera: Neural net-
Unpublished manuscript: http://arxiv. org/abs/1504. works for machine learning.
01106v5. Version, 5. [Xu et al.2015] Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron
[Nair and Hinton2010] Vinod Nair and Geoffrey E Hin- Courville, Ruslan Salakhutdinov, Richard Zemel, and
ton. 2010. Rectified linear units improve restricted Yoshua Bengio. 2015. Show, attend and tell: Neural
boltzmann machines. In Proceedings of the 27th In- image caption generation with visual attention. In Pro-
ternational Conference on Machine Learning (ICML- ceedings of 2015th International Conference on Ma-
10), pages 807–814. chine Learning.
[Pascanu et al.2014] Razvan Pascanu, Caglar Gulcehre, [Zhao et al.2015] Han Zhao, Zhengdong Lu, and Pascal
Kyunghyun Cho, and Yoshua Bengio. 2014. How to Poupart. 2015. Self-adaptive hierarchical sentence
model. In Proceedings of International Joint Confer-
ences on Artificial Intelligence.