0% found this document useful (0 votes)
19 views

AC LSTMNeuralNetworkforTextClassification

This document proposes a C-LSTM neural network model that combines a convolutional neural network (CNN) and long short-term memory (LSTM) network for text classification tasks. The CNN is used to extract higher-level phrase representations from sequences of word embeddings. The LSTM then takes these phrase representations as input to learn long-term dependencies and produce a sentence representation, combining local and global features. The C-LSTM model is evaluated on sentiment classification and question classification, outperforming CNN-only and LSTM-only baselines.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

AC LSTMNeuralNetworkforTextClassification

This document proposes a C-LSTM neural network model that combines a convolutional neural network (CNN) and long short-term memory (LSTM) network for text classification tasks. The CNN is used to extract higher-level phrase representations from sequences of word embeddings. The LSTM then takes these phrase representations as input to learn long-term dependencies and produce a sentence representation, combining local and global features. The C-LSTM model is evaluated on sentiment classification and question classification, outperforming CNN-only and LSTM-only baselines.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

A C-LSTM Neural Network for Text Classification

Chunting Zhou1 , Chonglin Sun2 , Zhiyuan Liu3 , Francis C.M. Lau1


Department of Computer Science, The University of Hong Kong1
School of Innovation Experiment, Dalian University of Technology2
Department of Computer Science and Technology, Tsinghua University, Beijing3
arXiv:1511.08630v2 [cs.CL] 30 Nov 2015

Abstract perform well due to the loss of word order informa-


tion. More recent models for distributed sentence
Neural network models have been demon- representation fall into two categories according to
strated to be capable of achieving remarkable
performance in sentence and document mod-
the form of input sentence: sequence-based models
eling. Convolutional neural network (CNN) and tree-structured models. Sequence-based models
and recurrent neural network (RNN) are two construct sentence representations from word
mainstream architectures for such modeling sequences by taking in account the relationship be-
tasks, which adopt totally different ways of tween successive words (Johnson and Zhang, 2015).
understanding natural languages. In this work, Tree-structured models treat each word token as a
we combine the strengths of both architectures node in a syntactic parse tree and learn sentence
and propose a novel and unified model called
representations from leaves to the root in a recursive
C-LSTM for sentence representation and text
classification. C-LSTM utilizes CNN to ex- manner (Socher et al., 2013b).
tract a sequence of higher-level phrase repre- Convolutional neural networks (CNNs)
sentations, and are fed into a long short-term and recurrent neural networks (RNNs) have
memory recurrent neural network (LSTM) to emerged as two widely used architectures
obtain the sentence representation. C-LSTM and are often combined with sequence-based
is able to capture both local features of phrases or tree-structured models (Tai et al., 2015;
as well as global and temporal sentence se-
Lei et al., 2015; Tang et al., 2015; Kim, 2014;
mantics. We evaluate the proposed archi-
tecture on sentiment classification and ques-
Kalchbrenner et al., 2014; Mou et al., 2015).
tion classification tasks. The experimental re- Owing to the capability of capturing local cor-
sults show that the C-LSTM outperforms both relations of spatial or temporal structures, CNNs
CNN and LSTM and can achieve excellent have achieved top performance in computer vi-
performance on these tasks. sion, speech recognition and NLP. For sentence
modeling, CNNs perform excellently in extracting
1 Introduction n-gram features at different positions of a sentence
through convolutional filters, and can learn short
As one of the core steps in NLP, sentence modeling and long-range relations through pooling opera-
aims at representing sentences as meaningful tions. CNNs have been successfully combined
features for tasks such as sentiment classification. with both sequence-based model (Denil et al., 2014;
Traditional sentence modeling uses the bag-of- Kalchbrenner et al., 2014) and tree-structured
words model which often suffers from the curse model (Mou et al., 2015) in sentence modeling.
of dimensionality; others use composition based The other popular neural network architecture –
methods instead, e.g., an algebraic operation over RNN – is able to handle sequences of any length
semantic word vectors to produce the semantic and capture long-term dependencies. To avoid the
sentence vector. However, such methods may not
problem of gradient exploding or vanishing in the with two tasks: sentiment classification and 6-way
standard RNN, Long Short-term Memory RNN question classification. Our evaluations show that
(LSTM) (Hochreiter and Schmidhuber, 1997) and the C-LSTM model can achieve excellent results
other variants (Cho et al., 2014) were designed for with several benchmarks as compared with a wide
better remembering and memory accesses. Along range of baseline models. We also show that
with the sequence-based (Tang et al., 2015) or the the combination of CNN and LSTM outperforms
tree-structured (Tai et al., 2015) models, RNNs individual multi-layer CNN models and RNN
have achieved remarkable results in sentence or models, which indicates that LSTM can learn long-
document modeling. term dependencies from sequences of higher-level
To conclude, CNN is able to learn local response representations better than the other models.
from temporal or spatial data but lacks the ability of
learning sequential correlations; on the other hand, 2 Related Work
RNN is specilized for sequential modelling but
unable to extract features in a parallel way. It has Deep learning based neural network mod-
been shown that higher-level modeling of xt can els have achieved great success in many
help to disentangle underlying factors of variation NLP tasks, including learning distributed
within the input, which should then make it easier word, sentence and document representa-
to learn temporal structure between successive time tion (Mikolov et al., 2013b; Le and Mikolov, 2014),
steps (Pascanu et al., 2014). For example, Sainath et parsing (Socher et al., 2013a), statistical machine
al. (Sainath et al., 2015) have obtained respectable translation (Devlin et al., 2014), sentiment clas-
improvements in WER by learning a deep LSTM sification (Kim, 2014), etc. Learning distributed
from multi-scale inputs. We explore training the sentence representation through neural network
LSTM model directly from sequences of higher- models requires little external domain knowledge
level representaions while preserving the sequence and can reach satisfactory results in related tasks
order of these representaions. In this paper, we like sentiment classification, text categorization.
introduce a new architecture short for C-LSTM by In many recent sentence representation learning
combining CNN and LSTM to model sentences. To works, neural network models are constructed upon
benefit from the advantages of both CNN and RNN, either the input word sequences or the transformed
we design a simple end-to-end, unified architecture syntactic parse tree. Among them, convolutional
by feeding the output of a one-layer CNN into neural network (CNN) and recurrent neural network
LSTM. The CNN is constructed on top of the (RNN) are two popular ones.
pre-trained word vectors from massive unlabeled The capability of capturing local correlations
text data to learn higher-level representions of along with extracting higher-level correlations
n-grams. Then to learn sequential correlations from through pooling empowers CNN to model sen-
higher-level suqence representations, the feature tences naturally from consecutive context windows.
maps of CNN are organized as sequential window In (Collobert et al., 2011), Collobert et al. applied
features to serve as the input of LSTM. In this way, convolutional filters to successive windows for
instead of constructing LSTM directly from the a given sequence to extract global features by
input sentence, we first transform each sentence max-pooling. As a slight variant, Kim et al. (2014)
into successive window (n-gram) features to help proposed a CNN architecture with multiple filters
disentangle factors of variations within sentences. (with a varying window size) and two ‘channels’
We choose sequence-based input other than relying of word vectors. To capture word relations of
on the syntactic parse trees before feeding in the varying sizes, Kalchbrenner et al. (2014) proposed
neural network, thus our model doesn’t rely on a dynamic k-max pooling mechanism. In a more
any external language knowledge and complicated recent work (Lei et al., 2015), Tao et al. apply
pre-processing. tensor-based operations between words to replace
In our experiments, we evaluate the semantic linear operations on concatenated word vectors
sentence representations learned from C-LSTM in the standard convolutional layer and explore
the non-linear interactions between nonconsective
n-grams. Mou et al. (2015) also explores convolu-
tional models on tree-structured sentences.
As a sequence model, RNN is able to deal The
movie
with variable-length input sequences and discover is
awesome
long-term dependencies. Various variants of RNN !
L×d
have been proposed to better store and access iput x
memories (Hochreiter and Schmidhuber, 1997;
Cho et al., 2014). With the ability of explicitly
window feature sequence
modeling time-series data, RNNs are being increas- feature maps LSTM

ingly applied to sentence modeling. For example, Figure 1: The architecture of C-LSTM for sentence modeling.
Tai et al. (2015) adjusted the standard LSTM to Blocks of the same color in the feature map layer and window
tree-structured topologies and obtained superior feature sequence layer corresponds to features for the same win-
results over a sequential LSTM on related tasks. dow. The dashed lines connect the feature of a window with the
In this paper, we stack CNN and LSTM in a source feature map. The final output of the entire model is the
unified architecture for semantic sentence mod- last hidden unit of LSTM.
eling. The combination of CNN and LSTM can
be seen in some computer vision tasks like image
3.1 N-gram Feature Extraction through
caption (Xu et al., 2015) and speech recogni-
Convolution
tion (Sainath et al., 2015). Most of these models
use multi-layer CNNs and train CNNs and RNNs The one-dimensional convolution involves a filter
separately or throw the output of a fully connected vector sliding over a sequence and detecting fea-
layer of CNN into RNN as inputs. Our approach is tures at different positions. Let xi ∈ Rd be the
different: we apply CNN to text data and feed con- d-dimensional word vectors for the i-th word in a
secutive window features directly to LSTM, and so sentence. Let x ∈ RL×d denote the input sentence
our architecture enables LSTM to learn long-range where L is the length of the sentence. Let k be the
dependencies from higher-order sequential fea- length of the filter, and the vector m ∈ Rk×d is a fil-
tures. In (Li et al., 2015), the authors suggest that ter for the convolution operation. For each position
sequence-based models are sufficient to capture the j in the sentence, we have a window vector wj with
compositional semantics for many NLP tasks, thus k consecutive word vectors, denoted as:
in this work the CNN is directly built upon word
wj = [xj , xj+1 , · · · , xj+k−1 ] (1)
sequences other than the syntactic parse tree. Our
experiments on sentiment classification and 6-way
Here, the commas represent row vector concatena-
question classification tasks clearly demonstrate the
tion. A filter m convolves with the window vectors
superiority of our model over single CNN or LSTM
(k-grams) at each position in a valid way to gener-
model and other related sequence-based models.
ate a feature map c ∈ RL−k+1 ; each element cj of
the feature map for window vector wj is produced
3 C-LSTM Model as follows:

cj = f (wj ◦ m + b), (2)


The architecture of the C-LSTM model is shown in
Figure 1, which consists of two main components: where ◦ is element-wise multiplication, b ∈ R is a
convolutional neural network (CNN) and long short- bias term and f is a nonlinear transformation func-
term memory network (LSTM). The following two tion that can be sigmoid, hyperbolic tangent, etc. In
subsections describe how we apply CNN to extract our case, we choose ReLU (Nair and Hinton, 2010)
higher-level sequences of word features and LSTM as the nonlinear function. The C-LSTM model uses
to capture long-term dependencies over window fea- multiple filters to generate multiple feature maps.
ture sequences respectively. For n filters with the same length, the generated n
feature maps can be rearranged as feature represen- architecture share the same dimension. The LSTM
tations for each window wj , transition functions are defined as follows:
W = [c1 ; c2 ; · · · ; cn ] (3) it = σ(Wi · [ht−1 , xt ] + bi ) (4)
Here, semicolons represent column vector concate- ft = σ(Wf · [ht−1 , xt ] + bf )
nation and ci is the feature map generated with the qt = tanh(Wq · [ht−1 , xt ] + bq )
i-th filter. Each row Wj of W ∈ R(L−k+1)×n is the ot = σ(Wo · [ht−1 , xt ] + bo )
new feature representation generated from n filters
ct = ft ⊙ ct−1 + it ⊙ qt
for the window vector at position j. The new succes-
sive higher-order window representations then are ht = ot ⊙ tanh(ct )
fed into LSTM which is described below.
Here, σ is the logistic sigmoid function that has an
A max-over-pooling or dynamic k-max pooling
output in [0, 1], tanh denotes the hyperbolic tangent
is often applied to feature maps after the convolu-
function that has an output in [−1, 1], and ⊙ denotes
tion to select the most or the k-most important fea-
the elementwise multiplication. To understand the
tures. However, LSTM is specified for sequence
mechanism behind the architecture, we can view ft
input, and pooling will break such sequence orga-
as the function to control to what extent the informa-
nization due to the discontinuous selected features.
tion from the old memory cell is going to be thrown
Since we stack an LSTM neural neural network on
away, it to control how much new information is go-
top of the CNN, we will not apply pooling after the
ing to be stored in the current memory cell, and ot to
convolution operation.
control what to output based on the memory cell ct .
3.2 Long Short-Term Memory Networks LSTM is explicitly designed for time-series data for
Recurrent neural networks (RNNs) are able to prop- learning long-term dependencies, and therefore we
agate historical information via a chain-like neu- choose LSTM upon the convolution layer to learn
ral network architecture. While processing se- such dependencies in the sequence of higher-level
quential data, it looks at the current input xt as features.
well as the previous output of hidden state ht−1 4 Learning C-LSTM for Text
at each time step. However, standard RNNs be- Classification
comes unable to learn long-term dependencies as
the gap between two time steps becomes large. For text classification, we regard the output of the
To address this issue, LSTM was first introduced hidden state at the last time step of LSTM as the
in (Hochreiter and Schmidhuber, 1997) and re- document representation and we add a softmax layer
emerged as a successful architecture since Ilya et on top. We train the entire model by minimizing
al. (2014) obtained remarkable performance in sta- the cross-entropy error. Given a training sample x(i)
tistical machine translation. Although many vari- and its true label y (i) ∈ {1, 2, · · · , k} where k is the
ants of LSTM were proposed, we adopt the standard number of possible labels and the estimated proba-
(i)
architecture (Hochreiter and Schmidhuber, 1997) in bilities yej ∈ [0, 1] for each label j ∈ {1, 2, · · · , k},
this work. the error is defined as:
The LSTM architecture has a range of repeated k
modules for each time step as in a standard RNN. X (i)
(i) (i)
L(x , y ) = 1{y (i) = j} log(e
yj ), (5)
At each time step, the output of the module is con- j=1
trolled by a set of gates in Rd as a function of the old
hidden state ht−1 and the input at the current time where 1{condition} is an indicator such
step xt : the forget gate ft , the input gate it , and the that 1{condition is true} = 1 otherwise
output gate ot . These gates collectively decide how 1{condition is false} = 0. We employ stochas-
to update the current memory cell ct and the cur- tic gradient descent (SGD) to learn the model
rent hidden state ht . We use d to denote the mem- parameters and adopt the optimizer RM-
ory dimension in the LSTM and all vectors in this Sprop (Tieleman and Hinton, 2012).
4.1 Padding and Word Vector Initialization 5 labels: very positive, positive, neural, negative,
First, we use maxlen to denote the maximum length very negative. We consider two classification tasks
of the sentence in the training set. As the convo- on this dataset: fine-grained classification with
lution layer in our model requires fixed-length in- 5 labels and binary classification by removing
put, we pad each sentence that has a length less neural labels. For the binary classification, the
than maxlen with special symbols at the end that dataset has a split of train (6920) / dev (872) / test
indicate the unknown words. For a sentence in the (1821). Since the data is provided in the format of
test dataset, we pad sentences that are shorter than sub-sentences, we train the model on both phrases
maxlen in the same way, but for sentences that and sentences but only test on the sentences as
have a length longer than maxlen, we simply cut in several previous works (Socher et al., 2013b;
extra words at the end of these sentences to reach Kalchbrenner et al., 2014).
maxlen. Question type classification: Question classifica-
We initialize word vectors with the publicly avail- tion is an important step in a question answering
able word2vec vectors1 that are pre-trained using system that classifies a question into a specific
about 100B words from the Google News Dataset. type, e.g. “what is the highest waterfall in the
The dimensionality of the word vectors is 300. We United States?” is a question that belongs to
also initialize the word vector for the unknown “location”. For this task, we use the benchmark
words from the uniform distribution [-0.25, 0.25]. TREC (Li and Roth, 2002). TREC divides all ques-
We then fine-tune the word vectors along with other tions into 6 categories, including location,
model parameters during training. human, entity, abbreviation,
description and numeric. The training
4.2 Regularization dataset contains 5452 labelled questions while the
For regularization, we employ two commonly used testing dataset contains 500 questions.
techniques: dropout (Hinton et al., 2012) and L2
5.2 Experimental Settings
weight regularization. We apply dropout to pre-
vent co-adaptation. In our model, we either apply We implement our model based on Theano
dropout to word vectors before feeding the sequence (Bastien et al., 2012) – a python library, which sup-
of words into the convolutional layer or to the output ports efficient symbolic differentiation and transpar-
of LSTM before the softmax layer. The L2 regular- ent use of a GPU. To benefit from the efficiency
ization is applied to the weight of the softmax layer. of parallel computation of the tensors, we train the
model on a GPU. For text preprocessing, we only
5 Experiments convert all characters in the dataset to lower case.
For SST, we conduct hyperparameter (number of
We evaluate the C-LSTM model on two tasks: (1)
filters, filter length in CNN; memory dimension in
sentiment classification, and (2) question type clas-
LSTM; dropout rate and which layer to apply, etc.)
sification. In this section, we introduce the datasets
tuning on the validation data in the standard split.
and the experimental settings.
For TREC, we hold out 1000 samples from the train-
5.1 Datasets ing dataset for hyperparameter search and train the
model using the remaining data.
Sentiment Classification: Our task in this regard
In our final settings, we only use one convolu-
is to predict the sentiment polarity of movie reviews.
tional layer and one LSTM layer for both tasks. For
We use the Stanford Sentiment Treebank (SST)
the filter size, we investigated filter lengths of 2, 3
benchmark (Socher et al., 2013b). This dataset
and 4 in two cases: a) single convolutional layer
consists of 11855 movie reviews and are split
with the same filter length, and b) multiple convolu-
into train (8544), dev (1101), and test (2210).
tional layers with different lengths of filters in paral-
Sentences in this corpus are parsed and all phrases
lel. Here we denote the number of filters of length i
along with the sentences are fully annotated with
by ni for ease of clarification. For the first case, each
1
http://code.google.com/p/word2vec/ n-gram window is transformed into ni convoluted
Model Fine-grained (%) Binary (%) Reported in
SVM 40.7 79.4 (Socher et al., 2013b)
NBoW 42.4 80.5 (Kalchbrenner et al., 2014)
Paragraph Vector 48.7 87.8 (Le and Mikolov, 2014)
RAE 43.2 82.4 (Socher, Pennington, et al., 2011)
MV-RNN 44.4 82.9 (Socher et al., 2012)
RNTN 45.7 85.4 (Socher et al., 2013b)
DRNN 49.8 86.6 (Irsoy and Cardie, 2014)
CNN-non-static 48.0 87.2 (Kim, 2014)
CNN-multichannel 47.4 88.1 (Kim, 2014)
DCNN 48.5 86.8 (Kalchbrenner et al., 2014)
Molding-CNN 51.2 88.6 (Lei et al., 2015)
Dependency Tree-LSTM 48.4 85.7 (Tai et al., 2015)
Constituency Tree-LSTM 51.0 88.0 (Tai et al., 2015)
LSTM 46.6 86.6 our implementation
Bi-LSTM 47.8 87.9 our implementation
C-LSTM 49.2 87.8 our implementation
Table 1: Comparisons with baseline models on Stanford Sentiment Treebank. Fine-grained is a 5-class classification task. Binary
is a 2-classification task. The second block contains the recursive models. The third block are methods related to convolutional
neural networks. The fourth block contains methods using LSTM (the first two methods in this block also use syntactic parsing
trees). The first block contains other baseline methods. The last block is our model.

features after convolution and the sequence of win- 6 Results and Model Analysis
dow representations is fed into LSTM. For the latter
case, since the number of windows generated from In this section, we show our evaluation results on
each convolution layer varies when the filter length sentiment classification and question type classifica-
varies (see L − k + 1 below equation (3)), we cut the tion tasks. Moreover, we give some model analysis
window sequence at the end based on the maximum on the filter size configuration.
filter length that gives the shortest number of win-
dows. Each window is represented as the concatena- 6.1 Sentiment Classification
tion of outputs from different convolutional layers. The results are shown in Table 1. We compare our
We also exploit different combinations of different model with a large set of well-performed models on
filter lengths. We further present experimental anal- the Stanford Sentiment Treebank.
ysis of the exploration on filter size later. According Generally, the baseline models consist of recur-
to the experiments, we choose a single convolutional sive models, convolutional neural network mod-
layer with filter length 3. els, LSTM related models and others. The re-
For SST, the number of filters of length 3 is set to cursive models employ a syntactic parse tree as
be 150 and the memory dimension of LSTM is set the sentence structure and the sentence representa-
to be 150, too. The word vector layer and the LSTM tion is computed recursively in a bottom-up man-
layer are dropped out with a probability of 0.5. For ner along the parse tree. Under this category, we
TREC, the number of filters is set to be 300 and the choose recursive autoencoder (RAE), matrix-vector
memory dimension is set to be 300. The word vec- (MV-RNN), tensor based composition (RNTN) and
tor layer and the LSTM layer are dropped out with multi-layer stacked (DRNN) recursive neural net-
a probability of 0.5. We also add L2 regularization work as baselines. Among CNNs, we compare with
with a factor of 0.001 to the weights in the softmax Kim’s (2014) CNN model with fine-tuned word vec-
layer for both tasks. tors (CNN-non-static) and multi-channels (CNN-
multichannel), DCNN with dynamic k-max pool-
Model Acc Reported in
SVM 95.0 Silva et al .(2011)
Paragraph Vector 91.8 Zhao et al .(2015)
Ada-CNN 92.4 Zhao et al .(2015)
CNN-non-static 93.6 Kim (2014)
CNN-multichannel 92.2 Kim (2014)
DCNN 93.0 Kalchbrenner et al. (2014)
LSTM 93.2 our implementation
Bi-LSTM 93.0 our implementation
C-LSTM 94.6 our implementation
Table 2: The 6-way question type classification accuracy on TREC.

ing, Tao’s CNN (Molding-CNN) with low-rank ten- linear feature mapping functions or appealing to
sor based non-linear and non-consecutive convo- tree-structured topologies before the convolutional
lutions. Among LSTM related models, we first layer.
compare with two tree-structured LSTM models
(Dependence Tree-LSTM and Constituency Tree- 6.2 Question Type Classification
LSTM) that adjust LSTM to tree-structured network The prediction accuracy on TREC question classifi-
topologies. Then we implement one-layer LSTM cation is reported in Table 2. We compare our model
and Bi-LSTM by ourselves. Since we could not tune with a variety of models. The SVM classifier uses
the result of Bi-LSTM to be as good as what has unigrams, bigrams, wh-word, head word, POS tags,
been reported in (Tai et al., 2015) even if following parser, hypernyms, WordNet synsets as engineered
their untied weight configuration, we report our own features and 60 hand-coded rules. Ada-CNN is a
results. For other baseline methods, we compare self-adaptiive hierarchical sentence model with gat-
against SVM with unigram and bigram features, ing networks. Other baseline models have been in-
NBoW with average word vector features and para- troduced in the last task. From Table 2, we have the
graph vector that infers the new paragraph vector following observations: (1) Our result consistently
for unseen documents. outperforms all published neural baseline models,
To the best of our knowledge, we achieve the which means that C-LSTM captures intentions of
fourth best published result for the 5-class classi- TREC questions well. (2) Our result is close to that
fication task on this dataset. For the binary clas- of the state-of-the-art SVM that depends on highly
sification task, we achieve comparable results with engineered features. Such engineered features not
respect to the state-of-the-art ones. From Table 1, only demands human laboring but also leads to the
we have the following observations: (1) Although error propagation in the existing NLP tools, thus
we did not beat the state-of-the-art ones, as an end- couldn’t generalize well in other datasets and tasks.
to-end model, the result is still promising and com- With the ability of automatically learning semantic
parable with thoes models that heavily rely on lin- sentence representations, C-LSTM doesn’t require
guistic annotations and knowledge, especially syn- any human-designed features and has a better scali-
tactic parse trees. This indicates C-LSTM will be bility.
more feasible for various scenarios. (2) Compar-
ing our results against single CNN and LSTM mod- 6.3 Model Analysis
els shows that LSTM does learn long-term depen- Here we investigate the impact of different filter con-
dencies across sequences of higher-level represen- figurations in the convolutional layer on the model
tations better. We could explore in the future how performance.
to learn more compact higher-level representations In the convolutional layer of our model, filters are
by replacing standard convolution with other non- used to capture local n-gram features. Intuitively,
multiple convolutional layers in parallel with differ-
a convolutional layer; sequences of such higher-
0.950
level representations are then fed into the LSTM
0.945 to learn long-term dependencies. We evaluated the
learned semantic sentence representations on senti-
0.940 ment classification and question type classification
Accuracy

tasks with very satisfactory results.


0.935
We could explore in the future ways to replace the
0.930 standard convolution with tensor-based operations
or tree-structured convolutions. We believe LSTM
0.925 will benefit from more structured higher-level repre-
sentations.
0.920
S:2 S:3 S:4 M:2,3 M:2,4 M:3,4 M:2,3,4
Filter configuration

References
Figure 2: Prediction accuracies on TREC questions with dif-
ferent filter size strategies. For the horizontal axis, S means [Bastien et al.2012] Frédéric Bastien, Pascal Lamblin,
single convolutional layer with the same filter length, and M Razvan Pascanu, James Bergstra, Ian J. Goodfellow,
means multiple convolutional layers in parallel with different Arnaud Bergeron, Nicolas Bouchard, and Yoshua Ben-
filter lengths. gio. 2012. Theano: new features and speed im-
provements. Deep Learning and Unsupervised Fea-
ture Learning NIPS 2012 Workshop.
ent filter sizes should perform better than single con- [Cho et al.2014] Kyunghyun Cho, Bart Van Merriënboer,
volutional layers with the same length filters in that Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,
different filter sizes could exploit features of differ- Holger Schwenk, and Yoshua Bengio. 2014. Learn-
ent n-grams. However, we found in our experiments ing phrase representations using rnn encoder-decoder
for statistical machine translation. arXiv preprint
that single convolutional layer with filter length 3 al-
arXiv:1406.1078.
ways outperforms the other cases.
[Collobert et al.2011] Ronan Collobert, Jason Weston,
We show in Figure 2 the prediction accuracies on
Léon Bottou, Michael Karlen, Koray Kavukcuoglu,
the 6-way question classification task using differ- and Pavel Kuksa. 2011. Natural language process-
ent filter configurations. Note that we also observe ing (almost) from scratch. The Journal of Machine
the similar phenomenon in the sentiment classifica- Learning Research, 12:2493–2537.
tion task. For each filter configuration, we report in [Denil et al.2014] Misha Denil, Alban Demiraj, Nal
Figure 2 the best result under extensive grid-search Kalchbrenner, Phil Blunsom, and Nando de Freitas.
on hyperparameters. It it shown that single convolu- 2014. Modelling, visualising and summarising doc-
tional layer with filter length 3 performs best among uments with a single convolutional neural network.
all filter configurations. For the case of multiple arXiv preprint arXiv:1406.3830.
convolutional layers in parallel, it is shown that fil- [Devlin et al.2014] Jacob Devlin, Rabih Zbib,
Zhongqiang Huang, Thomas Lamar, Richard
ter configurations with filter length 3 performs better
Schwartz, and John Makhoul. 2014. Fast and
that those without tri-gram filters, which further con- robust neural network joint models for statistical
firms that tri-gram features do play a significant role machine translation. In Proceedings of the 52nd
in capturing local features in our tasks. We conjec- Annual Meeting of the Association for Computational
ture that LSTM could learn better semantic sentence Linguistics, volume 1, pages 1370–1380.
representations from sequences of tri-gram features. [Hinton et al.2012] Geoffrey E Hinton, Nitish Srivas-
tava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R
7 Conclusion and Future Work Salakhutdinov. 2012. Improving neural networks
by preventing co-adaptation of feature detectors. The
We have described a novel, unified model called C- Computing Research Repository (CoRR).
LSTM that combines convolutional neural network [Hochreiter and Schmidhuber1997] Sepp Hochreiter and
with long short-term memory network (LSTM). C- Jürgen Schmidhuber. 1997. Long short-term memory.
LSTM is able to learn phrase-level features through Neural computation, 9(8):1735–1780.
[Irsoy and Cardie2014] Ozan Irsoy and Claire Cardie. construct deep recurrent neural networks. In Proceed-
2014. Deep recursive neural networks for composi- ings of the conference on International Conference on
tionality in language. In Advances in Neural Informa- Learning Representations (ICLR).
tion Processing Systems, pages 2096–2104. [Sainath et al.2015] Tara N Sainath, Oriol Vinyals, An-
[Johnson and Zhang2015] Rie Johnson and Tong Zhang. drew Senior, and Hasim Sak. 2015. Convolutional,
2015. Effective use of word order for text categoriza- long short-term memory, fully connected deep neural
tion with convolutional neural networks. Human Lan- networks. IEEE International Conference on Acous-
guage Technologies: The 2015 Annual Conference of tics, Speech and Signal Processing.
the North American Chapter of the ACL, pages 103– [Silva et al.2011] Joao Silva, Luı́sa Coheur, Ana Cristina
112. Mendes, and Andreas Wichert. 2011. From symbolic
[Kalchbrenner et al.2014] Nal Kalchbrenner, Edward to sub-symbolic information in question classification.
Grefenstette, and Phil Blunsom. 2014. A convo- Artificial Intelligence Review, 35(2):137–154.
lutional neural network for modelling sentences. [Socher et al.2012] Richard Socher, Brody Huval,
Association for Computational Linguistics (ACL). Christopher D Manning, and Andrew Y Ng. 2012.
[Kim2014] Yoon Kim. 2014. Convolutional neural net- Semantic compositionality through recursive matrix-
works for sentence classification. In Proceedings of vector spaces. In Proceedings of Empirical Methods
Empirical Methods on Natural Language Processing. on Natural Language Processing, pages 1201–1211.
[Le and Mikolov2014] Quoc Le and Tomas Mikolov. [Socher et al.2013a] Richard Socher, John Bauer,
2014. Distributed representations of sentences and Christopher D Manning, and Andrew Y Ng. 2013a.
documents. In Proceedings of the 31st International Parsing with compositional vector grammars. In In
Conference on Machine Learning (ICML-14), pages Proceedings of the ACL conference. Citeseer.
1188–1196. [Socher et al.2013b] Richard Socher, Alex Perelygin,
Jean Y Wu, Jason Chuang, Christopher D Manning,
[Lei et al.2015] Tao Lei, Regina Barzilay, and Tommi
Andrew Y Ng, and Christopher Potts. 2013b. Recur-
Jaakkola. 2015. Molding cnns for text: non-linear,
sive deep models for semantic compositionality over
non-consecutive convolutions. In Proceedings of Em-
a sentiment treebank. In Proceedings of Empirical
pirical Methods on Natural Language Processing.
Methods on Natural Language Processing, volume
[Li and Roth2002] Xin Li and Dan Roth. 2002. Learn- 1631, page 1642. Citeseer.
ing question classifiers. In Proceedings of the 19th in-
[Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals, and
ternational conference on Computational linguistics-
Quoc VV Le. 2014. Sequence to sequence learning
Volume 1, pages 1–7. Association for Computational
with neural networks. In Advances in neural informa-
Linguistics.
tion processing systems, pages 3104–3112.
[Li et al.2015] Jiwei Li, Dan Jurafsky, and Eudard Hovy. [Tai et al.2015] Kai Sheng Tai, Richard Socher, and
2015. When are tree structures necessary for deep Christopher D Manning. 2015. Improved semantic
learning of representations? In Proceedings of Em- representations from tree-structured long short-term
pirical Methods on Natural Language Processing. memory networks. Association for Computational
[Mikolov et al.2013b] Tomas Mikolov, Ilya Sutskever, Linguistics (ACL).
Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. [Tang et al.2015] Duyu Tang, Bing Qin, and Ting Liu.
Distributed representations of words and phrases and 2015. Document modeling with gated recurrent neural
their compositionality. In Advances in neural infor- network for sentiment classification. In Proceedings
mation processing systems, pages 3111–3119. of Empirical Methods on Natural Language Process-
[Mou et al.2015] Lili Mou, Hao Peng, Ge Li, Yan Xu, ing.
Lu Zhang, and Zhi Jin. 2015. Discriminative [Tieleman and Hinton2012] T. Tieleman and G Hinton.
neural sentence modeling by tree-based convolution. 2012. Lecture 6.5 - rmsprop, coursera: Neural net-
Unpublished manuscript: http://arxiv. org/abs/1504. works for machine learning.
01106v5. Version, 5. [Xu et al.2015] Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron
[Nair and Hinton2010] Vinod Nair and Geoffrey E Hin- Courville, Ruslan Salakhutdinov, Richard Zemel, and
ton. 2010. Rectified linear units improve restricted Yoshua Bengio. 2015. Show, attend and tell: Neural
boltzmann machines. In Proceedings of the 27th In- image caption generation with visual attention. In Pro-
ternational Conference on Machine Learning (ICML- ceedings of 2015th International Conference on Ma-
10), pages 807–814. chine Learning.
[Pascanu et al.2014] Razvan Pascanu, Caglar Gulcehre, [Zhao et al.2015] Han Zhao, Zhengdong Lu, and Pascal
Kyunghyun Cho, and Yoshua Bengio. 2014. How to Poupart. 2015. Self-adaptive hierarchical sentence
model. In Proceedings of International Joint Confer-
ences on Artificial Intelligence.

You might also like