0% found this document useful (0 votes)
61 views11 pages

Neurocomputing: Shusen Zhou, Qingcai Chen, Xiaolong Wang

This document summarizes a research paper about developing a new semi-supervised learning algorithm called active deep network (ADN) for sentiment classification using limited labeled data and abundant unlabeled data. The algorithm first constructs a deep network structure using restricted Boltzmann machines with unsupervised learning on labeled and unlabeled reviews. It then fine-tunes the network with supervised learning. Additionally, it applies active learning to select the most informative unlabeled reviews to label and train the network, in order to improve performance with minimal labeled data. Experiments show ADN and its variant that combines information density outperform other semi-supervised and deep learning methods for sentiment classification.

Uploaded by

karlTronxo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views11 pages

Neurocomputing: Shusen Zhou, Qingcai Chen, Xiaolong Wang

This document summarizes a research paper about developing a new semi-supervised learning algorithm called active deep network (ADN) for sentiment classification using limited labeled data and abundant unlabeled data. The algorithm first constructs a deep network structure using restricted Boltzmann machines with unsupervised learning on labeled and unlabeled reviews. It then fine-tunes the network with supervised learning. Additionally, it applies active learning to select the most informative unlabeled reviews to label and train the network, in order to improve performance with minimal labeled data. Experiments show ADN and its variant that combines information density outperform other semi-supervised and deep learning methods for sentiment classification.

Uploaded by

karlTronxo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Neurocomputing 120 (2013) 536546

Contents lists available at ScienceDirect

Neurocomputing
journal homepage: www.elsevier.com/locate/neucom

Active deep learning method for semi-supervised


sentiment classication
Shusen Zhou a,n, Qingcai Chen b, Xiaolong Wang b
a
School of Information and Electrical Engineering, Ludong University, Yantai, PR China
b
Shenzhen Graduate School, Harbin Institute of Technology, Shenzhen, PR China

art ic l e i nf o a b s t r a c t

Article history: In natural language processing community, sentiment classication based on insufcient labeled data is a
Received 9 May 2012 well-known challenging problem. In this paper, a novel semi-supervised learning algorithm called active
Received in revised form deep network (ADN) is proposed to address this problem. First, we propose the semi-supervised learning
9 April 2013
framework of ADN. ADN is constructed by restricted Boltzmann machines (RBM) with unsupervised
Accepted 15 April 2013
learning based on labeled reviews and abundant of unlabeled reviews. Then the constructed structure is
Communicated by M. Wang
Available online 23 May 2013 ne-tuned by gradient-descent based supervised learning with an exponential loss function. Second, in
the semi-supervised learning framework, we apply active learning to identify reviews that should be
Keywords: labeled as training data, then using the selected labeled reviews and all unlabeled reviews to train ADN
Neural networks
architecture. Moreover, we combine the information density with ADN, and propose information ADN
Deep learning
(IADN) method, which can apply the information density of all unlabeled reviews in choosing the manual
Active learning
Sentiment classication labeled reviews. Experiments on ve sentiment classication datasets show that ADN and IADN
outperform classical semi-supervised learning algorithms, and deep learning techniques applied for
sentiment classication.
& 2013 Elsevier B.V. All rights reserved.

1. Introduction typically domain-specic, which makes the expensive process of


annotating a large amount of data for each domain and is a bottle-
In recent years, sentiment analysis has received considerable neck in building high-quality systems [3]. This motivates the task of
attention in natural language processing (NLP) community [13]. learning robust sentiment models from minimal supervision [6].
Sentiment classication is a special type of text categorization, Recently, semi-supervised learning, which uses a large amount
where the criterion of classication is the attitude expressed in the of unlabeled data together with labeled data to build better
text, such as positive or negative, thumbs up or thumbs down, learners [7,8], has attracted more and more attention in sentiment
favorable or unfavorable, etc. [4], rather than the subject or topic. classication [3,6]. Wang et al. propose a novel semi-supervised
Labeling the reviews with their sentiment would provide succinct learning algorithm named semi-supervised kernel density estima-
summaries to readers, which makes it possible to focus the text tion, which is developed based on kernel density estima-
mining on areas in need of improvement or on areas of success [5] tion approach, both labeled and unlabeled data are leveraged to
and is helpful in business intelligence applications, recommender estimate class conditional probability densities based on
systems, and message ltering tasks [1]. an extended form of kernel density estimation [9]. Wang
While topics are often identiable by keywords alone, sentiment et al. propose a method named optimized multigraph-based
classication appears to be a more challenge task [1]. First, sentiment semi-supervised learning which aims to simultaneously tackle
is often conveyed with subtle linguistic mechanisms such as the use insufciency of training data and the curse of dimensionality
of sarcasm and highly domain-specic contextual cues [6]. For problems in a unied scheme [10]. Zha et al. propose a novel
example, although the sentence the thief tries to protect his graph-based learning framework in the setting of semi-supervised
excellent reputation contains the word excellent, it tells us nothing learning with multiple labels [11]. As argued by several research
about the author's opinion and in fact could be well embedded in works [12,13], deep architecture that is composed of multiple
a negative review. Second, sentiment classication systems are levels of non-linear operations [14], is expected to perform well in
semi-supervised learning, because of its capability of modeling
hard articial intelligent tasks. Deep belief network (DBN) is a
n
representative deep learning algorithm that has achieved notable
Corresponding author. Tel.: +86 13583573988.
E-mail addresses: [email protected] (S. Zhou),
success for semi-supervised learning in NLP community [14].
[email protected] (Q. Chen), [email protected] (X. Wang). Ranzato and Szummer [15] introduce an algorithm to learn text

0925-2312/$ - see front matter & 2013 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.neucom.2013.04.017
S. Zhou et al. / Neurocomputing 120 (2013) 536546 537

document representations, which is based on semi-supervised The rest of the paper is organized as follows. Section 2 gives an
auto-encoders that are combined to form a deep network. overview of sentiment classication. The proposed semi-supervised
Active learning is another way that can minimize the number of learning method ADN is described in Section 3. Section 4 combines
required labeled data while getting a competitive result. Rather than ADN and information density into IADN method. Section 5 evaluates
choosing the training set randomly, active learning chooses the ADN and IADN by comparing their classication performance with
training data actively, which reduces the needs of labeled data [16]. existing sentiment classication methods and deep learning methods
It has been widely explored in multimedia research community for on sentiment datasets. The paper is closed with a conclusion.
its capability of reducing human annotation effort [17]. Zha et al.
propose a novel active learning approach based on the optimum
experimental design criteria in statistics for interactive video index- 2. Sentiment classication
ing [18]. Active learning is well-suited to many NLP problems, where
unlabeled data may be abundant but annotation is slow and Sentiment classication can be performed on words, sentences
expensive [19]. Druck et al. propose an active learning approach in or documents, and is generally categorized into lexicon-based [23]
which the machine solicits labels on features rather than instances and corpus-based classication methods [24]. The detail survey
[20]. Zhu et al. combine active and semi-supervised learning under a about techniques and approaches of sentiment classication can
Gaussian random eld model; the active learning scheme requires a be seen in the book [25]. In this paper, we focus on corpus-based
much smaller number of queries to achieve high accuracy compared classication methods.
with random query selection [21]. Recently, active learning has been Corpus-based methods use a labeled corpus to train a sentiment
applied in sentiment classication [3]. classier [24]. Pang et al. [1] are the rst who apply machine learning
Inspired by the study of semi-supervised learning, active approach to corpus-based sentiment classication. They found that
learning and deep learning, this paper proposes a semi- standard machine learning techniques outperform human-produced
supervised sentiment classication method called active deep baselines. They also carried out important experiments on selecting
network (ADN). It is based on a representative deep learning the best features and concluded that unigrams performed better than
method DBN [14] and active learning method [16]. First, we bigrams or unigrams and bigrams together. Dave et al. [26] draw
introduce the semi-supervised learning procedure of ADN method, attention on information retrieval techniques for feature extraction
which constructs the deep architecture with all unlabeled and and scoring in the sentiment classication task. Pang and Lee [27]
labeled reviews, and ne tunes the deep architecture with few apply text-categorization techniques to the subjective portions of the
labeled reviews. To maximize the separability of the classier, an sentiment documents. These portions are extracted by efcient
exponential loss function is suggested. Second, we introduce the techniques for nding minimum cuts in graphs. Gamon [5] demon-
active learning procedure of ADN method. It rst identies a small strates that high accuracy can be achieved by using large feature
number of unlabeled reviews for manual labeling by an active vectors in combination with feature reduction in the very noisy
learner, and then trains the deep architecture with the labeled domain of customer feedback data. Mullen and Collier [28] use
reviews and all other unlabeled reviews. Moreover, we propose support vector machines to bring together diverse sources of
information ADN (IADN) method, to combine the information potentially pertinent information for sentiment classication, includ-
density with ADN, which puts the information density of all ing several favorability measures for phrases and adjectives and,
unlabeled reviews into consideration while choosing the unla- where available, knowledge of the topic of the text. Ng et al. [29]
beled reviews for further labeling. demonstrate that sentiment classication can be performed with
The main contributions of this paper include: First, this paper high accuracy using only unigrams as features. Mcdonald et al. [30]
introduces a new deep architecture that integrates the abstraction investigate a structured model for jointly classifying the sentiment of
ability of deep belief networks and the classication ability of text at various levels of granularity, which is based on standard
backpropagation strategy. It improves the generalization capability sequence classication techniques using constrained Viterbi to
by using the abundant number of unlabeled reviews, and directly ensure consistent solutions. Xia et al. [31] introduce the sentiment
optimizes the classication results in training dataset via the back vector space model to represent song lyric document, assign the
propagation strategy, which makes it possible to achieve attractive sentiment labels such as light-hearted and heavy-hearted. Li et al.
classication performance with few labeled reviews. Second, this [32] propose a machine learning approach to incorporate polarity
paper proposes two effective active learning methods that integrate shifting information into a document-level sentiment classication
the review selection ability of active learning and classication ability system. Liu et al. [33] present an adaptive sentiment analysis model
of deep architecture. Both the labeled review selector and classier that aims to capture the hidden sentiment factors in reviews through
are based on the same architecture, which provides a unied the capability of being incrementally updated as more data becoming
framework for the semi-supervised classication task. Third, this available. Wei et al. [34] propose a novel approach to label attributes
paper applies semi-supervised learning and active learning to senti- of a product and their associated sentiments in product reviews by a
ment classication successfully and gets competitive performance. hierarchical learning process with a dened sentiment ontology tree.
Our experimental results on ve sentiment classication datasets Supervised sentiment classication systems are domain-
show that both ADN and IADN outperform previous sentiment specic and annotating a large-scale corpus for each domain is
classication methods and deep learning methods. very expensive [3]. There exists several solutions for this issue.
This paper is an expanded version of Zhou et al. [22]. Many new The rst solution is cross-domain sentiment classication. Aue
contents are incorporated here: First, the related works of senti- and Gamon [35] survey four different approaches to customize a
ment classication have been extended; more detail introduction sentiment classication system for a new target domain in the
about sentiment classication methods has been made. Second, an absence of large amounts of labeled data. Blitzer et al. [2]
active learning method called IADN is proposed, which combines investigate domain adaptation for sentiment classiers, which
information density with ADN and achieves competitive perfor- reducing the relative error due to adaptation between domains
mance for sentiment classication. Third, more experiments have by an average of 46% over a supervised baseline, and identify a
been conducted to evaluate the performance of deep architecture, measure of domain similarity that correlates well with the
information density incorporation, and various loss functions. potential for adaptation of a classier from one domain to another.
Moreover, we evaluate the proposed active learning methods with Tan et al. [36] combine old-domain labeled examples with new-
a different number of labeled and unlabeled reviews. domain unlabeled ones, and retrain the base classier over all
538 S. Zhou et al. / Neurocomputing 120 (2013) 536546

these examples. Li and Zong [37] study multi-domain sentiment problem. Section 3.2 proposes the semi-supervised learning
classication which aims to improve performance through fusing method of ADN. Section 3.3 proposes active learning method of
training data from multiple domains. Pan et al. [38] propose a ADN. Section 3.4 gives the ADN procedure.
cross-domain sentiment classication method that aligns domain-
specic words extracted from different domains into unied 3.1. Problem formulation
clusters, with the help of domain-independent words as a bridge.
Bollegala et al. [39] automatically create a sentiment sensitive The dataset is compound by a substantial amount of product
thesaurus using both labeled and unlabeled data from multiple reviews. We preprocess the reviews to be classied, the experi-
source domains to nd the association between words that mental setting is same with [3]. Each review is represented as a
express similar sentiments in different domains. He et al. [40] vector of unigrams, using binary weight equal to 1 for terms
modify the joint sentiment-topic model by incorporating word present in a vector. Moreover, the punctuation, numbers, and
polarity priors through modifying the topic-word Dirichlet priors, words of length one are removed from the vector. Finally, we sort
study the polarity-bearing topics extracted by joint sentiment- the vocabulary by document frequency and remove the top 1.5%. It
topic model and show that by augmenting the original feature is because that many of these high document frequency words are
space with polarity-bearing topics, achieve the state-of-the-art stopwords or domain specic general-purpose words (e.g., book
performance of 95% on the movie review data and an average of in the book domain), these noise words would not be helpful for
90% on the multi-domain sentiment dataset. sentiment classication. These words typically comprise 12% of a
The second solution is semi-supervised sentiment classica- vocabulary, the decision of exactly how many terms to remove is
tion. Goldberg and Zhu [41] present a graph-based semi-super- subjective: a large corpus typically requires more removals than a
vised learning algorithm to address the sentiment analysis task of small corpus. To be consistent, we simply remove the top 1.5% high
rating inference, inferring numerical ratings based on the per- frequency words.
ceived sentiment. Sindhwani and Melville [42] propose a semi- After preprocessing, each review is represented by a vector.
supervised sentiment classication algorithm that utilizes lexical Then the dataset is represented as a matrix
prior knowledge in conjunction with unlabeled data. Dasgupta and 2 1 3
x1 ; x21 ; ; xRT
Ng [3] rst mine the unambiguous reviews using spectral techni- 1
6 1 7
ques, and then exploit them to classify the ambiguous reviews via 6 x2 ; x22 ; ; xRT 7
X x1 ; x2 ; ; xRT  6
6 ; ;
2 7
7 1
a novel combination of active learning, transductive learning, and 4 ; 5
ensemble learning. Li et al. [4] adopt two views, personal and x1D x2D ; ; xRT
D
impersonal views, and employ them in both supervised and semi-
supervised sentiment classication. where R is the number of training reviews, T is the number of test
The third solution is unsupervised sentiment classication. reviews, D is the number of feature words in the dataset. Each
Zagibalov and Carroll [43] describe an automatic seed word column of X corresponds to a sample x of review. A sample that
selection method for unsupervised sentiment classication of has all features is viewed as a vector in RD , where the jth
product reviews in Chinese. coordinate corresponds to the jth feature.
There are also several other methods to solve this issue. Read [44] The L labeled reviews are chosen randomly from R training
demonstrates that training data automatically labeled with encoun- reviews, or chosen actively by active learning, which can be seen
tered there emoticons has the potential of being independent of as
domain, topic and time. Wan [24] studies on cross-lingual sentiment
XL XR S; S s1 ; ; sL 1 si R 2
classication, which leverages an available English corpus for Chinese
sentiment classication by using the English corpus as training data. where S is the index set of selected training reviews to be labeled
Machine translation services are used for eliminating the language gap manually.
between the training set and test set, English features and Chinese Let Y be a set of labels correspond to L labeled training reviews
features are considered as two independent views of the classication and is denoted as
problem. Lu et al. [45] present a novel approach for joint bilingual 2 1 3
y1 ; y21 ; ; yL1
sentiment classication at the sentence level that augments available 6 1 L 7
6 y2 ; y2 ; ; y2 7
2
labeled data in each language with unlabeled parallel data. Y L y1 ; y2 ; ; yL  6
6 ;
7 3
However, unsupervised learning of sentiment is difcult, partially 4 ; ; 7 5
because of the prevalence of sentimentally ambiguous reviews [3]. yC ; yC ; ; yC
1 2 L

Using multi-domain sentiment corpus to sentiment classication is


where C is the number of classes. Each column of Y is a vector in
also hard to apply. It is because that each domain has a very limited
RC , where the jth coordinate corresponds to the jth class
amount of training data, due to annotating a large corpus is difcult (
and time-consuming [37]. Cross-domain, semi-supervised learning 1 if xjth class
and unsupervised learning methods are used based on the back- yj 4
1 if xjth class
ground that training data is not enough. When there are not enough
training data for each domain, we can use cross-domain methods. For example, if a review xi is positive, yi 1; 1; otherwise,
When there are not enough labeled data, we can use semi- yi 1; 1.
supervised learning methods. When there is no labeled data, we We intend to seek the mapping function XL -Y L using the L
can use unsupervised learning methods. In this paper, we just focus labeled reviews and all unlabeled reviews in order to determine y
on semi-supervised sentiment classication methods. when a new review x comes.

3.2. Semi-supervised learning


3. Active deep networks
To address the problem formulated in Section 3.1, we propose a
In this part, we propose a semi-supervised learning algorithm, deep architecture for ADN method, as shown in Fig. 1. The deep
active deep network (ADN), to address the sentiment classication architecture is a fully interconnected directed belief nets with one
0 1 2 N
problem with active learning. Section 3.1 formulates the ADN input layer h , N hidden layers h ; h ; ; h , and one labeled layer
S. Zhou et al. / Neurocomputing 120 (2013) 536546 539

k k1
The conditional distributions over h and h are given as
k k1 k k1
ph jh pht jh 8
t

k1 k k1 k
ph jh phs jh 9
s

the probability of turning unit t is a logistic function of the states


k1
of h and wkst
 
k k1 k1
pht 1jh sigm ckt wkst hs 10
s

the probability of turning unit s is a logistic function of the states


k
of h and wkst
 
k1 k k1 k
phs 1jh sigm bs wkst ht 11
t

where the logistic function is


sigm 1=1 e 12
The derivative of the log-likelihood with respect to the model
parameter wk can be obtained from Eq. (6)
k1
log ph k1 k k1 k
Fig. 1. Architecture of active deep networks. hs ht P 0 hs ht P Model 13
wkst
0
at the top. The input layer h has D units, equal to the number of where  P0 denotes an expectation with respect to the data
features of sample review x. The label layer has C units, equal to distribution and  P Model denotes an expectation with respect to
the number of classes of label vector y. The numbers of units for the distribution dened by the model [46].
The expectation  P Model cannot be computed analytically. In
hidden layers, currently, are pre-dened according to the experi- 
ence or intuition. The seeking of the mapping function, here, is practice,  P Model is replaced by iP 1 , which denotes a distribution
transformed to the problem of nding the parameter space of samples when the feature detectors are being driven by
k1
W fw1 ; w2 ; ; wN g for the deep architecture. reconstructed h . This is an approximation to the gradient of a
The semi-supervised learning method based on ADN architec- different objective function, called the contrastive divergence (CD)
ture can be divided into two stages: First, ADN architecture is [47]
constructed by greedy layer-wise unsupervised learning using k1 k k1 k
wkst hs ht P0 hs ht P 1 14
RBMs as building blocks. All the unlabeled reviews together with
L labeled reviews are utilized to nd the parameter space W with where is the learning rate.
N layers. Second, ADN architecture is trained according to the Then the parameter wk can be adjusted through
exponential loss function using gradient descent method. The wkst wkst wkst 15
parameter space W is retrained by an exponential loss function
where is the momentum.
using L labeled data. As it is difcult to optimize a deep archi-
The above discussion is based on the training of the parameters
tecture using supervised learning directly, the unsupervised learn-
between two hidden layers with one sample review x. For
ing stage can abstract the reviews effectively, and prevent
unsupervised learning, we construct the deep architecture using
overtting of the supervised training.
all labeled reviews with unlabeled reviews by inputting them one
For unsupervised learning, we dene the energy of the joint 0 0 1
k1 k by one from layer h , train the parameters between h and h .
conguration h ; h as [14] 1 1 0
Then h is constructed, the value of h is calculated by h and the
Dk1 Dk Dk1 Dk 0 1
k1 k k1 k k1 k1 k trained parameters between h and h . We can use it to construct
Eh ; h ; wkst hs ht bs hs ckt ht 5 2
s1t 1 s1 t1 the next layer h . The deep architecture is constructed layer by
layer from bottom to top. In each time, the parameter space wk is
where w; b; c are the model parameters: wkst is the symmetric
k1 trained by the calculated data in the k1th layer.
interaction term between unit s in the layer h and unit t in the k
k k1 k1 According to the wk calculated above, the layer h is obtained
layer h , k 1; ; N1. bs is the sth bias of layer h and ctk is 0
k as below for a sample x fed from layer h
the tth bias of layer h . Dk is the number of units in the kth layer. !
The network assigns a probability to every possible data via this k
Dk1
k1
ht x sigm ckt wkst hs x t 1; ; Dk ; k 1; ; N1
energy function. The probability of a training data can be raised by s1
adjusting the weights and biases to lower the energy of that data
k 16
and to raise the energy of similar, confabulated data that h would
k N
prefer to the real data. When we input the value of h , the network The parameter space w is initialized randomly, just as back-
k1
can learn the content of h by minimizing this energy function. propagation algorithm. Then ADN architecture is constructed. The
k1
The probability that the model assigns to a h is top hidden layer is formulated as
1 DN1
k1 k1 k N N1
Ph ; expEh ; h ; 6 ht x cN N
t wst hs x t 1; ; DN 17
Z hk s1
For supervised learning, the ADN architecture is trained by L
Z expEh
k1 k
; h ; 7 labeled data. The optimization problem is formulized as
k1 k
h h N
argmin f h XL ; Y L 18
N
where Z denotes the normalizing constant. h
540 S. Zhou et al. / Neurocomputing 120 (2013) 536546

where instances from the pool, has much better performance than the
L C regular random method. So we incorporate this Balance idea
N N
fh XL ; Y L Thj xi yij 19 with ADN method. However, it is not possible to choose an equal
i1j1
number of positive and negative instances without labeling the
and the loss function is dened as entire pool of instances in advance. So we present a simple way to
approximate the balance of positive and negative reviews. For each
Tr exp r 20
iteration, we count, rst, the number of positive and negative
In the supervised learning stage, the stochastic activities are labeled reviews respectively. Second, we classify the unlabeled
replaced by deterministic, real valued probabilities. The greedy reviews in the pool with the deep architecture trained by the
layer-wise unsupervised learning is just used to initialize the previous iteration. Third, we choose the appropriate number of
parameter of deep architecture, the parameters of the deep positive and negative reviews labeled in the second step and add
architecture are updated based on Eq. (15). After initialization, them into the labeled dataset, to let the number of labeled and
real values are used in all the nodes of the deep architecture. We unlabeled review equally. Fourth, we relabel all these new added
use gradient-descent through the whole deep architecture to reviews manually to ensure the correctness of all the review's label
retrain the weights for optimal classication. in the labeled dataset.

3.3. Active learning


3.4. ADN procedure

Semi-supervised learning allows us to classify reviews with few


The procedure of ADN is shown in Algorithm 1. For the training
labeled data. However, annotating the reviews manually is expen-
of ADN architecture, the parameters are random initialized with
sive, so we expect to get higher performance with fewer labeled
normal distribution. All the training data and test data are used to
data. Active learning can help to choose those reviews that should
train the ADN with unsupervised learning, which can be seen as
be labeled manually in order to achieving higher classication
transductive learning [48]. The training set XR can be seen as an
performance with the same number of labeled data. For such
unlabeled pool. We randomly select one positive and one negative
purpose, we incorporate pool-based active learning with the ADN
review in the pool to input as the initial labeled training set that
method, which accesses to a pool of unlabeled instances and
are used for supervised learning. The number of units in hidden
requests the labels for some number of them [16].
layer D1 ; ; DN and the number of epochs Q are set manually
Given an unlabeled pool XR and an initial labeled dataset XL
N based on the dimension of the input data and the size of training
(one positive, one negative), the ADN architecture h will decide
R N dataset. The iteration times I and the number of active choosing
which instance in X to query next. Then the parameters of h are
reviews for each iteration G can be set based on the number of
adjusted after new reviews are labeled and inserted into the
labeled reviews in the experiment.
labeled dataset. The main issue for an active learner is the
For each iteration, the ADN architecture is re-trained by all the
choosing of next unlabeled instance to query. In this paper, we
unlabeled and labeled reviews with unsupervised learning and
choose the reviews of which the labels are most uncertain for the
supervised learning rst, the parameters of deep architecture are
classier. Drawing on previous work on active learning [3,16], we
initialized with the training results of previous iteration. Then we
dene the uncertainty of a review as the reciprocal of its distance
choose G reviews from the unlabeled pool based on the distance of
from the separating hyperplane. In other words, reviews that are
these data from the separating line, label these reviews manually,
near the separating hyperplane are more uncertain than the
and add them into the labeled dataset. For the next iteration, the
reviews that are farther away.
unsupervised learning is initialized with the parameters trained in
The deep architecture of ADN is trained by all unlabeled data
the supervised stage of previous iteration, then the supervised
and initial labeled training set with DBN based semi-supervised
learning is applied based on the new labeled dataset again. The
learning rst, which has been introduced in Section 3.2. After
unsupervised and supervised learning stages in turn can adjust the
semi-supervised learning, the parameters of ADN are adjusted.
parameters with each other, and improve the abstraction and
Given an unlabeled pool XR , the next unlabeled instance to
N classication ability of the deep architecture. At last, ADN archi-
be queried is chosen according to the location of h XR . The
N i tecture is retrained by all the unlabeled reviews and existing
distance between a point h x and the classes separation line
N N labeled reviews. After training, the ADN architecture is tested
h1 h2 is
based on Eq. (23).
i N N
p
d jh1 xi h2 xi j= 2 21 Since the proposed ADN method can active choose the labeled
dataset and classify the reviews with the same architecture, which
The selected training review to be labeled manually is given by
avoids the barrier between choosing and training procedures with
j different architectures. More importantly, the parameters of ADN
s fj : d min dg 22
are trained iteratively on the labeled data selection process, which
We can select a group of the most uncertain reviews to label at further improves the performance of ADN.
each time.
The experimental setting is similar with Dasgupta and Ng [3]. Algorithm 1. Active deep networks procedure.
We perform active learning for ve iterations and select 20 of the
most uncertainty reviews to be queried each time. Then the ADN is Input: data X, XL ; Y L (one positive and one negative reviews);
retrained on all labeled and unlabeled reviews so far with semi- number of layers N; number of epochs Q;
supervised learning. At last, the label of a review x is determined number of training data R; number of test data T;
N
according to the output h x of the ADN architecture as below normal distribution based random initialize parameter space W;
8 number of iterations I; number of active choosing reviews for
<1 N N
if hj x max h x each iteration G;
yj 23 Output: deep architecture with parameter space W
: 1 if hj xmax h x
N N

for i 1; i I do
As shown by Tong and Koller [16], the balance random method, Step 1. Greedy layer-wise unsupervised learning
which randomly sample an equal number of positive and negative for n 1; n N1 do
S. Zhou et al. / Neurocomputing 120 (2013) 536546 541

N N
for q 1; q Q do location of h XR . The informativeness of h x is weighted by
for k 1; k R T do its average similarity to other samples which in the same side of
N
Calculate the non-linear positive and negative: the separation line with h x. It is formalized as
k
pht 1jh
k1 k1
sigm ckt s wkst hs !
1 U
i i N i N j
k1 k
phs 1jh sigm bs t wkst ht
k1 k ID d  dish x ; h x 24
U1 j 1;ji
Update the weights and biases:
k1 k
wkst hs ht P 0 hs ht P 1
k1 k where
N N N N
end XU fj : xj XR h1 xh2 x  h1 xj h2 xj 4 0g 25
end
indicates the unlabeled instances that belong to the same class of x
end
based on the classication result of current trained classier
Step 2. Supervised learning with gradient descent
N N N N N N N
Minimize f h X; Y on labeled dataset XL , update the dish xi ; h xj jh1 xi h1 xj j jh2 xi h2 xj j 26
parameter N i N j i
denotes the distance of h x and h x . d denotes the distance
N
space W according to: argminhN f h XL ; Y L N
between a point h xi and the separation line, and is dened by
Step 3. Choose instances for labeled dataset Eq. (21). controls the relative importance of the density term.
Choose G instances which near the separating line by: The training reviews that should be labeled manually are given
j by
s fj : d mindg
Add G instances into the labeled dataset XL s fxi : IDi minIDg 27
end
Train ADN with Steps 1 and 2. The balance selection procedure in ADN is not consider the
cases when there are not enough positive (or negative) reviews for
section, it will select the reviews randomly in this case. For IADN
method, we select all remaining positive (or negative) reviews,
4. Information ADN
and ll the gap with remaining negative (or positive) reviews in
this case. The density calculation relies on the prediction of the
In this part, we combine information density idea [19] with
labels by the classier is necessary, although it might be mislead
ADN, propose a novel information ADN (IADN) method for semi-
in. Because we do not know the label of the reviews in the pool.
supervised sentiment classication.
The classier can recognize most of the reviews rightly. Even
The proposed ADN method can actively choose the reviews that
though some reviews which near the separation line are recog-
are near the separating hyperplane as the training data to be labeled
nized wrongly, there are little effect for the density calculation.
manually. However, ADN does not consider the information density of
! ! Because these wrong recognized reviews are close to these two
these review candidates. For example, in Fig. 2, the samples A and B
classes of reviews at the same time.
are two labeled examples, the other circles are unlabeled data. Since
! The ADN and IADN architecture have different number of hidden
C is the nearest sample to decision boundary, it should be chosen by
! units for each hidden layer. The width of deep architecture used in
ADN method. However, C is far from the center of two classes, i.e., it
different datasets is setting based on the scale of the dataset and the
is not a representative sample in the distribution. In this case, querying
! dimension of the input data. If the scale of the dataset is increasing,
D is likely to contain more information about the dataset. The IADN
and the dimension of the input data is increasing, the number of
method is proposed to put this observation into consideration.
hidden units for each hidden layer is increasing too. Because more
When the deep architecture is trained by L labeled data and all
N parameters in the large deep architecture need more high dimension
unlabeled data, the parameters are adapted, h x is used to
training data to train the architecture effectively. Moreover, the
represent the sample x. Given an unlabeled pool XR , the next
number of units in the last hidden layer is more than other hidden
unlabeled instance to be queried is chosen according to the
layers, because the last hidden layer is linear, and the other hidden
layers are non-linear, more units are needed to be used in linear layer
to represent a model.

5. Experiments
A
We conduct several experiments to compare the performance
D of ADN and IADN with which of existing methods. We have the
following questions in mind while designing and conducting the
experiments:

1. How do ADN and IADN perform when compared with other


state-of-the-art semi-supervised learning methods for senti-
ment classication?
B 2. How do ADN and IADN perform when compared with semi-
C supervised learning method based on our proposed deep
architecture?
3. How does information density performs when there are few
labeled data?
4. How does deep architecture performs for different loss
functions?
5. How does varying the number of labeled reviews affect the
Fig. 2. Illustration of information density idea. performance of ADN and IADN?
542 S. Zhou et al. / Neurocomputing 120 (2013) 536546

6. How does varying the number of unlabeled reviews affect the 5.2. ADN performance
performance of ADN and IADN?
To compare the performance of ADN and IADN with previous
works, similar to Dasgupta and Ng [3], we randomly divide the
These questions are answered in the following subsections: 2000 reviews into 10 folds and test all the algorithms using cross-
question 1 in Section 5.2, question 2 in Section 5.3, question 3 in validation. All reviews are used as unlabeled data, where 1000
Section 5.4, question 4 in Section 5.5, question 5 in Section 5.6, and used for training and the other 1000 for test. In each fold, 100
question 6 in Section 5.7. reviews are random selected as training data and the remaining
100 reviews are used for test. For the randomness involved in the
choice of labeled data, all the results of Spectral, TSVM, and DBN
5.1. Experimental setup methods are acquired by repeating 10 times for each fold and then
taking average over results. For Active, MECH, ADN, and IADN
The performance of proposed ADN and IADN methods is methods, one positive and one negative reviews are selected for
evaluated by ve sentiment classication datasets. The rst the initialization of active leaning, 100 labeled reviews are chosen
dataset is MOV [1], which is a widely-used movie review dataset. from the training dataset by active learning and are used for
The other four datasets contain reviews of four types of products, training the classier. For Active, MECH, ADN, and IADN methods,
which include books (BOO), DVDs (DVD), electronics (ELE), and the active learning is performed for ve iterations. In each
kitchen appliances (KIT) respectively [2,3]. Each dataset contains iteration, 20 of the most uncertain points are selected and labeled,
1000 positive and 1000 negative reviews. and then the classier is retrained on all of the unlabeled reviews
For MOV dataset, the ADN and IADN structures used in this and labeled reviews annotated so far. After ve iterations, 100
experiment are 100-100-200-2, which represents the number of labeled reviews are used for training. For these active learning
units in output layer is 2, and in 3 hidden layers are 100, 100, and 200 methods, the initial two labeled reviews are selected randomly, so
respectively. For the other four datasets, the ADN and IADN structures we repeat 30 times for each method, and the results are averaged.
used in this experiment are 50-50-200-2. The number of units in For Spectral, TSVM, DBN, and RAE methods, 100 labeled reviews
input layer is the same as the dimensions of each dataset. The are selected randomly. For Active, MECH, ADN, and IADN methods,
number of units in the third hidden layer is more than previous two 100 labeled reviews are selected by active learning, just the rst
hidden layers, because the unit of third hidden layer is linear, more two labeled reviews are selected randomly.
units can improve the representation ability of third hidden layer. As The classication accuracies on test data in cross validation for
the size of vocabulary in MOV dataset is larger than in other four ve datasets and eight methods are shown in Table 1. The results
datasets, the number of units in previous two hidden layers for MOV of previous four methods are reported by Dasgupta and Ng [3]. The
dataset is more than for other four datasets. The architecture of ADN structure and parameter used for DBN are the same as ADN and
and IADN is similar with DBN, but with a different loss function IADN in this experiment. The experiment of RAE is done based on
introduced for supervised learning stage. The parameters of the deep the default parameters of the source code proposed by Socher
architecture are xed as the default parameter settings of Hinton et al. [23]. Through Table 1, we can see that the performance of
DBN package [14]. For greedy layer-wise unsupervised learning, we DBN is competitive with MECH. Since MECH is the combination of
train the weights of each layer independently with the 30 epochs and spectral clustering, TSVM and active learning, DBN is just a
the learning rate is set to 0.1. The initial momentum is 0.5 and after classication method based on deep neural network, this result
ve epochs, the momentum is set to 0.9. For supervised learning, we shows the good learning ability of deep architecture. ADN is a
run 10 epochs, three times of linear searches are performed in combination of semi-supervised learning and active learning
each epoch. based on deep architecture, the performance of ADN is better
We compare the classication performance of ADN and IADN than previous six methods on all the ve datasets. This could be
with six representative classiers, i.e., semi-supervised spectral contributed by: First, ADN uses a deep architecture to guide the
learning (Spectral) [49], transductive SVM (TSVM), active learning output vector of samples belonged to different regions of new
(Active) [16], mine the easy classify the hard (MECH) [3], deep Euclidean space, which can abstract the useful information that is
belief networks (DBN) [14], and recursive autoencoders (RAE) [23]. not accessible to other learners; Second, ADN uses an exponential
Spectral learning, TSVM, and active learning method are three loss function to maximize the separability of labeled reviews in
baseline methods for sentiment classication. Spectral learning global renement for better discriminability; Third, ADN fully
incorporates labeled data into the clustering framework in the exploits the embedding information from the large amount of
form of must-link and cannot-link constraints. TSVM is the semi- unlabeled reviews to improve the robustness of the classier;
supervised learning version of SVM. Active learning is implemen- Fourth, ADN chooses the useful training reviews actively, which
ted based on SVM, which training an inductive SVM on one labeled also improves the classication performance. The performance of
review from each class, iteratively labeling the most uncertain IADN is better than previous seven methods. It is because that the
unlabeled reviews and re-training the SVM until 100 reviews are semi-supervised learning method used in IADN is same as in ADN,
labeled. MECH is a new semi-supervised method for sentiment
classication [3], which rst mines the unambiguous reviews
Table 1
using spectral techniques, and then exploits them to classify the Test accuracy with 100 labeled reviews for ve datasets and eight methods.
ambiguous reviews via a novel combination of active learning,
transductive learning, and ensemble learning. The implementation Type MOV KIT ELE BOO DVD
details of Spectral, TSVM, Active, and MECH methods are intro-
Spectral 67.3 63.7 57.7 55.8 56.2
duced by Dasgupta and Ng [3]. DBN is a classical deep learning
TSVM 68.7 65.5 62.9 58.7 57.3
method proposed recently [14]. The parameters of the DBN in our Active 68.9 68.1 63.3 58.6 58.0
experiment are xed as the default parameter settings of Hinton MECH 76.2 74.1 70.6 62.1 62.7
DBN package [14], the minimum mean squared error loss function DBN 71.3 72.6 73.6 64.3 66.7
is used in DBN architecture. RAE learns vector space representa- RAE 66.3 69.4 68.2 61.3 63.1
ADN 76.3 77.5 76.8 69.0 71.6
tions for multi-word phrases based on recursive autoencoders. The IADN 76.4 78.2 77.9 69.7 72.2
implementation details of RAE are introduced by Socher et al. [23].
S. Zhou et al. / Neurocomputing 120 (2013) 536546 543

80
ADN
IADN
75

70

Accuracy (%)
65

60

55

50

Fig. 3. Test accuracy of DBN and ADN with different experiment setting on ve
datasets. 45
0 1 2 3 4 5
Number of iterations
and the active learning method used in IADN improves the Fig. 4. Performance curve of ADN and IADN with iterations of active learning.
classication performance.

5.3. Effect of active learning

To evaluate the contribution of the active learning in the


proposed methods, we conduct following experiments. The archi-
tectures used in this section are the same as Section 5.2.
Passive learning: We are randomly select 100 reviews from the
training fold and use them as labeled data. Then the semi-
supervised learning method used in ADN is applied, to train and
test the performance, which is called ADN with passive learning
method (or Passive learning for short) and is denoted as Pas. in
Fig. 3. The experiment is run 10 times for each fold, and the
average is taken over all results. The test accuracies of DBN, ADN
with passive learning, and ADN on ve datasets are shown in
Fig. 3. Compared with DBN, the mean accuracy on ve datasets for
ADN with passive learning has been improved from 69.7% to 71.2%,
which is contributed by the exponential loss function used in ADN
architecture. Compared with ADN method, the mean accuracy on
ve datasets for ADN with passive learning is reduced from 74.2%
to 71.2%, which proves the effectiveness of the active learning.
Fully supervised learning: We train a fully supervised classier Fig. 5. Test accuracy of ADN and IADN with 10 labeled reviews on ve datasets.
using all 1000 training reviews based on the ADN architecture,
which is called ADN with supervised learning and is denoted by The architectures used in this section are the same as Section 5.2.
Sup. in Fig. 3. The test accuracies of ADN, and ADN with Different from previous experiments, for active learning of ADN
supervised learning on ve datasets are also shown in Fig. 3. and IADN, two of the most uncertain points are selected and
Compared with the ADN method, we can see that employing only labeled in each iteration, after ve iterations, 10 labeled reviews
100 active learning points enables us to reach nearly the same are used for training.
performance as the fully-supervised method on three datasets. The test accuracies of ADN and IADN with 10 labeled reviews
Compared with ADN method, the mean accuracy on ve datasets on ve datasets are shown in Fig. 5. We can see that the
for ADN with supervised learning is just improved from 74.2% to performance of IADN is better than ADN in all ve datasets.
75.4%. However, the number of required labeled data has been Because for every iteration, just two most uncertain points are
increased from 100 to 1000. selected and labeled. The wrong select of any points can let the
Performance curve: We use KIT dataset to test the performance performance of the classier worse. So this experimental setting
curve of ADN and IADN with iterations of active learning, the can emphasize the effect of information density idea.
results are shown in Fig. 4. Through the gure, we can see that the
performance of IADN is much better than ADN, especially in 5.5. Effect of loss function
previous iterations. This proves the effect of information density
method. Moreover, with iterations of active leaning, ADN and In ADN and IADN architectures, we use exponential loss
IADN methods curve quickly. function to replace the squared error loss function of classical
DBN architecture. Another type of popular loss function is hinge
5.4. Effect of information density loss function used in SVM. The detail analysis about these loss
functions can be seen in [50]. In this part, we just experimentally
To evaluate the contribution of the information density idea in evaluate the performance of these loss functions for sentiment
the proposed IADN methods, we conduct following experiments. classication.
544 S. Zhou et al. / Neurocomputing 120 (2013) 536546

The test accuracies of deep architectures with different loss 5.6. Semi-supervised learning with variance of labeled data
functions on ve datasets are shown in Fig. 6. The results show
that exponential loss function reaches the best performance on all To verify the performance of ADN and IADN with different
sentiment datasets, and the differences against the second best number of labeled data, we conduct another series of experiments
methods are statistically signicant (p o 0:05) with the paired t- on ve datasets and show the results in Fig. 7. The architectures
test for all ve datasets. The performance of squared error loss for ADN and IADN used in this experiment are the same as Section
function is competitive with hinge loss function. This proves the 5.2. For both ADN and IADN methods, we repeat 30 times for each
effectiveness of exponential loss function used in ADN and IADN experimental setting, and the results are averaged.
architecture. Fig. 7 shows that ADN and IADN can reach a relative high
accuracy by using just 20 labeled reviews for training. For most of
the ve sentiment datasets, the test accuracies are increasing
slowly while the number of labeled review is growing. It also
shows that the performance of ADN is competitive with IADN. In
MOV and ELE datasets, the performance of ADN is even better than
IADN. Because there are few abnormal reviews in these two
datasets, the information density restriction does not take effect
in these two experiments. In the other three datasets, the
performance of IADN is better than ADN.

5.7. Semi-supervised learning with variance of unlabeled data

To verify the contribution of unlabeled reviews for ADN and IADN


methods, we conduct several experiments with different number of
unlabeled reviews and 100 labeled reviews. In these experiments,
1000 reviews are used as training data, ADN and IADN can select 100
reviews actively and labeled these reviews for supervised learning.
Comparing with the experiments in Section 5.2, we just reduce the
Fig. 6. Test accuracy of deep architecture with different loss function on ve
datasets. number of unlabeled data in unsupervised learning stage. The
architectures for ADN and IADN used here are also the same as
Section 5.2. For both ADN and IADN methods, we repeat 30 times for
each experimental setting, and the results are averaged.
The test accuracies of ADN and IADN with different number of
unlabeled reviews and 100 labeled reviews on ve datasets are
shown in Fig. 8. We can see that ADN and IADN perform well when

Fig. 7. Test accuracy of ADN and IADN with different number of labeled reviews on Fig. 8. Test accuracy of ADN and IADN with different number of unlabeled reviews
ve datasets. on ve datasets.
S. Zhou et al. / Neurocomputing 120 (2013) 536546 545

just using 800 unlabeled reviews. When the number of unlabeled [5] M. Gamon, Sentiment classication on customer feedback data: noisy data,
reviews is reduced from 2000 to 800, the performance of ADN and large feature vectors, and the role of linguistic analysis, in: International
Conference on Computational Linguistics, Association for Computational
IADN is not worse. For DVD dataset, the performance of ADN and Linguistics, Switzerland, 2004, pp. 841847.
IADN which use 800 unlabeled reviews is better than them using [6] T. Li, Y. Zhang, V. Sindhwani, A non-negative matrix tri-factorization approach
2000 unlabeled reviews. When the number of unlabeled reviews to sentiment classication with lexical prior knowledge, in: Joint Conference
of the 47th Annual Meeting of the Association for Computational Linguistics
is reduced from 800 to 100, the performance of ADN and IADN get and 4th International Joint Conference on Natural Language Processing of the
worse quickly. This proves that ADN and IADN can get competitive Asian Federation of Natural Language Processing, Association for Computa-
performance with just few labeled reviews and appropriate tional Linguistics, Singapore, 2009, pp. 244252.
[7] R. Raina, A. Battle, H. Lee, B. Packer, A. Y. Ng, Self-taught learning: transfer
number of unlabeled reviews. Inclusion of unlabeled data does
learning from unlabeled data, in: International Conference on Machine
always improve the performance, however, if there are enough learning, ACM, Corvallis, Oregon, USA, 2007, pp. 759766.
unlabeled data, add more unlabeled data just add much time [8] X. Zhu, Semi-supervised Learning Literature Survey, Technical Report, Uni-
needed for training, the performance will not improve signi- versity of Wisconsin Madison, Madison, WI, USA, 2007.
[9] M. Wang, X.-S. Hua, T. Mei, R. Hong, G. Qi, Y. Song, L.-R. Dai, Semi-supervised
cantly. Considering the much time needed for training with more kernel density estimation for video annotation, Comput. Vis. Image Under-
unlabeled reviews and less accuracy improved for ADN and IADN standing 113 (2009) 384396.
method, we suggest using appropriate number of unlabeled [10] M. Wang, X.-S. Hua, R. Hong, J. Tang, G.-J. Qi, Y. Song, Unied video annotation via
multigraph learning, IEEE Trans. Circuits Syst. Video Technol. 19 (2009) 733746.
reviews in real application. In this experiment, the performance [11] Z. Zha, T. Mei, J. Wang, Z. Wang, X. Hua, Graph-based semi-supervised learning
of ADN is competitive with IADN too. with multiple labels, J. Visual Commun. Image Representation 20 (2009)
97103.
[12] Y. Bengio, Learning Deep Architectures for AI, Technical Report, IRO, Universite
de Montreal, 2007.
6. Conclusions [13] R. Salakhutdinov, G.E. Hinton, Learning a nonlinear embedding by preserving
class neighbourhood structure, J. Mach. Learn. Res. 2 (2007) 412419.
This paper proposes a novel semi-supervised learning algorithm [14] G.E. Hinton, R. Salakhutdinov, Reducing the dimensionality of data with neural
networks, Science 313 (2006) 504507.
ADN to address the sentiment classication problem with a small [15] M. Ranzato, M. Szummer, Semi-supervised learning of compact document
number of labeled reviews. ADN can choose the proper training representations with deep networks, in: International Conference on Machine
reviews to be labeled manually, and fully exploit the embedding Learning, ACM, Helsinki, Finland, 2008, pp. 792799.
[16] S. Tong, D. Koller, Support vector machine active learning with applications to
information from the large amount of unlabeled reviews to improve
text classication, J. Mach. Learn. Res. 2 (2002) 4566.
the robustness of the classier. We propose a new architecture [17] M. Wang, X.-S. Hua, Active learning in multimedia annotation and retrieval: a
to guide the output vector of samples locating into different regions survey, ACM Trans. Intell. Syst. Technol. 2 (2011) 121.
[18] Z. Zheng-Jun, W. Meng, Z. Yan-Tao, Y. Yi, H. Richang, C. Tat-Seng, Interactive
of new Euclidean space, and use an exponential loss function to
video indexing with statistical active learning, IEEE Trans. Multimedia 14
maximize the separability of labeled reviews in global renement for (2012) 1727.
better discriminability. Moreover, we also propose IADN method, [19] B. Settles, M. Craven, An analysis of active learning strategies for sequence
which puts the information density of different reviews into con- labeling tasks, in: Conference on Empirical Methods in Natural Language
Processing, Association for Computational Linguistics, Stroudsburg, PA, USA,
sideration when choosing reviews to be labeled manually. 2008, pp. 10701079.
The performance of ADN and IADN is compared with existing [20] G. Druck, B. Settles, A. McCallum, Active learning by labeling features, in:
semi-supervised learning methods and deep learning technique. Conference on Empirical Methods in Natural Language Processing, Association
for Computational Linguistics, Singapore, 2009, pp. 8190.
Experiment results show that both ADN and IADN reach better [21] X. Zhu, J. Lafferty, Z. Ghahramani, Combining active learning and semi-
performance than compared methods. We also conduct experi- supervised learning using Gaussian elds and harmonic functions, in: ICML
ments to verify the effectiveness of ADN and IADN with different 2003 Workshop on the Continuum from Labeled to Unlabeled Data in
Machine Learning and Data Mining, AAAI, Washington DC, USA, 2003,
number of labeled reviews and unlabeled reviews separately, and
pp. 5865.
demonstrate that ADN and IADN can get competitive classication [22] S. Zhou, Q. Chen, X. Wang, Active deep networks for semi-supervised
performance just by using few labeled reviews and appropriate sentiment classication, in: International Conference on Computational Lin-
number of unlabeled reviews. guistics, Coling 2010 Organizing Committee, Beijing, China, 2010, pp. 1515
1523.
[23] R. Socher, J. Pennington, E.H. Huang, A.Y. Ng, C.D. Manning, Semi-supervised
recursive autoencoders for predicting sentiment distributions, in: Conference
Acknowledgments on Empirical Methods in Natural Language Processing, Association for
Computational Linguistics, Edinburgh, UK, 2011, pp. 151161.
[24] X. Wan, Co-training for cross-lingual sentiment classication, in: Joint Con-
This work is supported in part by the Scientic Research Fund ference of the 47th Annual Meeting of the Association for Computational
of Ludong University (LY2013004) and National Natural Science Linguistics and 4th International Joint Conference on Natural Language
Processing of the Asian Federation of Natural Language Processing, Associa-
Foundation of China (Nos. 61173075 and 60973076).
tion for Computational Linguistics, Stroudsburg, PA, USA, 2009, pp. 235243.
[25] B. Pang, L. Lee, Opinion mining and sentiment analysis, Found. Trends Inf. Retr.
References 2, 2008.
[26] K. Dave, S. Lawrence, D.M. Pennock, Mining the peanut gallery: opinion
extraction and semantic classication of product reviews, in: International
[1] B. Pang, L. Lee, S. Vaithyanathan, Thumbs up? Sentiment classication using Conference on World Wide Web, ACM, New York, NY, USA, 2003, pp. 519528.
machine learning techniques, in: Conference on Empirical Methods in Natural [27] B. Pang, L. Lee, A sentimental education: sentiment analysis using subjectivity
Language Processing, Association for Computational Linguistics, Stroudsburg, summarization based on minimum cuts, in: 42th Annual Meeting of the
PA, USA, 2002, pp. 7986. Association of Computational Linguistics, Association for Computational
[2] J. Blitzer, M. Dredze, F. Pereira, Biographies, bollywood, boom-boxes and Linguistics, Barcelona, Spain, 2004, pp. 271278.
blenders: domain adaptation for sentiment classication, in: Annual Meeting [28] T. Mullen, N. Collier, Sentiment analysis using support vector machines with
of the Association of Computational Linguistics, Association for Computational diverse information sources, in: Conference on Empirical Methods in Natural
Linguistics, Prague, Czech Republic, 2007, pp. 440447. Language Processing, Association for Computational Linguistics, Barcelona,
[3] S. Dasgupta, V. Ng, Mine the easy, classify the hard: a semi-supervised Spain, 2004, pp. 412418.
approach to automatic sentiment classication, in: Joint Conference of the [29] V. Ng, S. Dasgupta, S.M.N. Arin, Examining the role of linguistic knowledge
47th Annual Meeting of the Association for Computational Linguistics and 4th sources in the automatic identication and classication of reviews, in: 21st
International Joint Conference on Natural Language Processing of the Asian International Conference on Computational Linguistics and 44th Annual
Federation of Natural Language Processing, Association for Computational Meeting of the Association for Computational Linguistics, Association for
Linguistics, Stroudsburg, PA, USA, 2009, pp. 701709. Computational Linguistics, Sydney, Australia, 2006, pp. 611618.
[4] S. Li, C.-R. Huang, G. Zhou, S.Y.M. Lee, Employing personal/impersonal views in [30] R. Mcdonald, K. Hannan, T. Neylon, M. Wells, J. Reynar, Structured models for
supervised and semi-supervised sentiment classication, in: Annual Meeting ne-to-coarse sentiment analysis, in: Annual Meeting of the Association of
of the Association for Computational Linguistics, Association for Computa- Computational Linguistics, Association for Computational Linguistics, Prague,
tional Linguistics, Uppsala, Sweden, 2010, pp. 414423. Czech Republic, 2007, pp. 432439.
546 S. Zhou et al. / Neurocomputing 120 (2013) 536546

[31] Y. Xia, L. Wang, K.-F. Wong, M. Xu, Lyric-based song sentiment classication [48] T. Joachims, Transductive inference for text classication using support vector
with sentiment vector space model, in: Annual Meeting of the Association of machines, in: International Conference on Machine Learning, Morgan Kauf-
Computational Linguistics, Association for Computational Linguistics, Colum- mann Publishers, Bled, Slovenia, 1999, pp. 200209.
bus, Ohio, 2008, pp. 133136. [49] S. Kamvar, D. Klein, C. Manning, Spectral learning, in: International Joint
[32] S. Li, S.Y.M. Lee, Y. Chen, C.-R. Huang, G. Zhou, Sentiment classication and Conferences on Articial Intelligence, AAAI Press, Catalonia, Spain, 2003,
polarity shifting, in: International Conference on Computational Linguistics, pp. 561566.
Association for Computational Linguistics, Stroudsburg, PA, USA, 2010, [50] Y. Liu, S. Zhou, Q. Chen, Discriminative deep belief networks for visual data
pp. 635643. classication, Pattern Recognition 44 (2011) 22872296.
[33] Y. Liu, X. Yu, X. Huang, A. An, S-PLASA+: adaptive sentiment analysis with
application to sales performance prediction, in: International ACM SIGIR
Conference on Research and Development in Information Retrieval, ACM,
New York, NY, USA, 2010, pp. 873874.
Shusen Zhou received the Ph.D. degree in computer
[34] W. Wei, J.A. Gulla, Sentiment learning on product reviews via sentiment
application technology from the Harbin Institute of
ontology tree, in: Annual Meeting of the Association for Computational
Technology in 2012. He is currently an assistant pro-
Linguistics, Association for Computational Linguistics, Stroudsburg, PA, USA,
fessor in Ludong University. His main research interests
2010, pp. 404413.
include machine learning, articial intelligence, multi-
[35] A. Aue, M. Gamon, Customizing sentiment classiers to new domains: a case
media content analysis and computational linguistics.
study, in: International Conference on Recent Advances in Natural Language
Processing, RANLP 2005 Organising Committee, Borovets, Bulgaria, 2005.
[36] S. Tan, G. Wu, H. Tang, X. Cheng, A novel scheme for domain-transfer problem
in the context of sentiment analysis, in: Conference on Information and
Knowledge Management, ACM, New York, NY, USA, 2007, pp. 979982.
[37] S. Li, C. Zong, Multi-domain sentiment classication, in: 46th Annual Meeting
of the Association of Computational Linguistics, Association for Computational
Linguistics, Columbus, Ohio, 2008, pp. 257260.
[38] S.J. Pan, X. Ni, J.-T. Sun, Q. Yang, Z. Chen, Cross-domain sentiment classication
via spectral feature alignment, in: International World Wide Web Conference,
ACM, New York, NY, USA, 2010, pp. 751760. Qingcai Chen received the Ph.D. degree in computer
[39] D. Bollegala, D. Weir, J. Carroll, Using multiple sources to construct a sentiment science from the Computer Science and Engineering
sensitive thesaurus for cross-domain sentiment classication, in: Annual Department, Harbin Institute of Technology. From Sep-
Meeting of the Association for Computational Linguistics, Association for tember 2003 to August 2004, he worked for Intel
Computational Linguistics, Portland, Oregon, USA, 2011, pp. 132141. (China) Ltd. as a senior software engineer. Since Sep-
[40] Y. He, C. Lin, H. Alani, Automatically extracting polarity-bearing topics for tember 2004, he has been with the Computer Science
cross-domain sentiment classication, in: Annual Meeting of the Association and Technology Department of Harbin Institute of
for Computational Linguistics, Association for Computational Linguistics, Technology Shenzhen Graduate School as an associate
Portland, Oregon, USA, 2011, pp. 123131. professor. His research interests include machine learn-
[41] A.B. Goldberg, X. Zhu, Seeing stars when there aren't many stars: graph-based ing, pattern recognition, speech signal processing, and
semi-supervised learning for sentiment categorization, in: Proceedings of natural language processing.
TextGraphs: The First Workshop on Graph Based Methods for Natural
Language Processing, Association for Computational Linguistics, Stroudsburg,
PA, USA, 2006, pp. 4552.
[42] V. Sindhwani, P. Melville, Document-word co-regularization for semi-
supervised sentiment analysis, in: International Conference on Data Mining,
IEEE, Pisa, Italy, 2008, pp. 10251030. Xiaolong Wang received the B.E. degree in computer
[43] T. Zagibalov, J. Carroll, Automatic seed word selection for unsupervised science from the Harbin Institute of Electrical Technol-
sentiment classication of Chinese text, in: International Conference on ogy, Harbin, China, in 1982, the M.E. degree in compu-
Computational Linguistics, Association for Computational Linguistics, Strouds- ter architecture from Tianjin University, Tianjin, China,
burg, PA, USA, 2008, pp. 10731080. in 1984, and the Ph.D. degree in computer science and
[44] J. Read, Using emoticons to reduce dependency in machine learning techni- engineering from the Harbin Institute of Technology in
ques for sentiment classication, in: the Association of Computational 1989. He was an Assistant Lecturer in 1984 and an
Linguistics Student Research Workshop, Association for Computational Lin- Associate Professor in 1990 with the Harbin Institute of
guistics, Ann Arbor, Michigan, 2005, pp. 4348. Technology. From 1998 to 2000, he was a Senior
[45] B. Lu, C. Tan, C. Cardie, B.K. Tsou, Joint bilingual sentiment classication with Research Fellow with the Department of Computing,
unlabeled parallel corpora, in: Annual Meeting of the Association for Compu- Hong Kong Polytechnic University, Kowloon. He is
tational Linguistics, Association for Computational Linguistics, Portland, Ore- currently a Professor of computer science with the
gon, USA, 2011, pp. 320330. Harbin Institute of Technology Shenzhen Graduate
[46] R. Salakhutdinov, I. Murray, On the quantitative analysis of deep belief School. His research interest includes articial intelligence, machine learning,
networks, in: International Conference on Machine learning, ACM, Helsinki, computational linguistics, and Chinese information processing.
Finland, 2008, pp. 872879.
[47] G.E. Hinton, Training products of experts by minimizing contrastive diver-
gence, Neural Comput. 14 (2002) 17711800.

You might also like