0% found this document useful (0 votes)
21 views10 pages

Semi-Supervised Multi-Label Learning by Solving A Sylvester Equation

Uploaded by

aegr82
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views10 pages

Semi-Supervised Multi-Label Learning by Solving A Sylvester Equation

Uploaded by

aegr82
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Semi-supervised Multi-label Learning by Solving a Sylvester Equation


Gang Chen Yangqiu Song∗ Fei Wang∗ Changshui Zhang∗

Abstract an image can be annotated as “road” as well as “car”,


Multi-label learning refers to the problems where an where the terms “road” and “car” are different cate-
instance can be assigned to more than one category. gories. Similarly, in text categorization, each document
In this paper, we present a novel Semi-supervised al- usually has different topics (e.g. “politics”, “economy”
gorithm for Multi-label learning by solving a Sylvester and “military”), where different topics are different cat-
Equation (SMSE). Two graphs are first constructed on egories.
instance level and category level respectively. For in- The most common approach toward multi-label
stance level, a graph is defined based on both labeled learning is to decompose it into multiple independent
and unlabeled instances, where each node represents binary classification problems, one for each category.
one instance and each edge weight reflects the simi- The final labels for each instance can be determined by
larity between corresponding pairwise instances. Simi- combining the classification results from all the binary
larly, for category level, a graph is also built based on classifiers. The advantage of this method is that many
all the categories, where each node represents one cat- state-of-the-art binary classifiers can be readily used to
egory and each edge weight reflects the similarity be- build a multi-label learning machine. However, this
tween corresponding pairwise categories. A regulariza- approach ignores the underlying mutual correlations
tion framework combining two regularization terms for among different categories, while in practice, which
the two graphs is suggested. The regularization term usually do exist and could have significant contributions
for instance graph measures the smoothness of the la- to the classification performance. Zhu et al. [35] gives an
bels of instances, and the regularization term for cat- example illustrating the importance of considering the
egory graph measures the smoothness of the labels of category correlations. To take the dependencies among
categories. We show that the labels of unlabeled data categories into account, a straightforward approach is
finally can be obtained by solving a Sylvester Equation. to transform the multi-label learning problem into a set
Experiments on RCV1 data set show that SMSE can of binary classification problems where each possible
make full use of the unlabeled data information as well combination of categories rather than each category
as the correlations among categories and achieve good is regarded as a new class. In other words, a multi-
performance. In addition, we give a SMSE’s extended label learning problem with n different categories would
application on collaborative filtering. be converted into 2n − 1 binary classification problems
where each class corresponds to a possible combination
Keywords of the original categories. However, this approach has
two serious drawbacks. First, when the number of
Multi-label learning, Graph-based semi-supervised
original categories is quite large, the number of the
learning, Sylvester equation, Collaborative filtering
combined classes, which increases exponentially, would
become too large to be tractable; Second, when there
1 Introduction
are very few instances in many combined classes, the
Many learning problems require each instance to be as- data sparsity problem would occur. So this approach is
signed to multiple different categories, which are gen- limited to a relatively small number of categories and
erally called multi-label learning problems. Multi-label assumes that the amount of training data is sufficient
learning problems arise in many practical applications for training each binary classifier. In the past years,
such as automatic image annotation and text catego- many novel multi-label learning algorithms modeling
rization. For example, in automatic image annotation, the correlations among categories have been developed
[3, 7, 9–11, 14, 16, 17, 21, 22, 28–30, 33, 35], some of which
∗ State Key Laboratory on Intelligent Technology and Sys- will be introduced briefly in Section 2.
tems, Tsinghua National Laboratory for Information Science
In this paper, we present a novel Semi-supervised
and Technology(TNList), Tsinghua University, Beijing 100084,
P. R. China, {g-c05, songyq99, feiwang03}@mails.thu.edu.cn, Multi-label learning framework by solving a Sylvester
[email protected] Equation (SMSE). Two graphs are first constructed on

410

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited


instance level and category level respectively. For in- function of category labels from the labeled instances
stance level, a graph is defined based on both labeled and apply it to classify each unknown test instance by
and unlabeled instances, where each node represents one choosing all the categories with the scores above the
instance and each edge weight reflects the similarity be- given threshold. Compared with the above binary clas-
tween corresponding pairwise instances; Similarly, for sification approach, the label ranking approaches can
category level, a graph is also built based on all the cat- be more appropriate to deal with large number of cat-
egories, where each node represents one category and egories because only one ranking function need to be
each edge reflects the similarity between corresponding learned to compare the relevance of individual category
pairwise categories. Then we define a quadratic energy labels with respect to test instances. The label rank-
function on each graph, and by minimizing the com- ing approaches also avoid the unbalanced data problem
bination of the two energy functions that balance the since they do not make binary decisions on category la-
two energy terms, the labels of unlabeled data can be bels. Although the label ranking approaches provide a
inferred. Here, the correlations among different cate- unique way to handle the multi-label learning problem,
gories have been considered via the energy function for they do not exploit the correlations among data cate-
category graph. In fact, our algorithm can be viewed gories either.
as a regularization framework including two regulariza- Recently, more and more approaches for multi-label
tion terms corresponding to the two energy functions learning that consider the correlations among categories
respectively. The regularization term for instance graph have been developed. Ueda et al. [30] suggests a gen-
measures the smoothness of the labels of instances, and erative model which incorporates the pairwise correla-
the regularization term for category graph measures the tion between any two categories into multi-label learn-
smoothness of the labels of categories. Finally, the la- ing. Griffiths et al. [12] proposes a Bayesian model to
bels of unlabeled instances can be obtained by solving determine instance labels via underlying latent repre-
a Sylvester Equation. sentations. Zhu et al. [35] employs a maximum en-
The rest of this paper is organized as follows: we tropy method for multi-label learning to model the cor-
first give a brief summary of related work on multi- relations among categories. McCallum [22] and Yu
label learning in Section 2; Section 3 describes our semi- et al. [33] apply approaches based on latent variables
supervised multi-label learning algorithm; in Section to capture the correlations among different categories.
4, we discuss our algorithm’ relationship with spectral Cai et al. [5] and Rousu et al. [23] assume a hierar-
clustering; the data and experimental results are pre- chical structure among the categories labels to handle
sented in Section 5; Section 6 presents our algorithm’ the correlation information among categories. Kang et
extended application on collaborative filtering, followed al. [16] gives a correlated label propagation framework
by our conclusions in Section 7. for multi-label learning that explicitly exploits the cor-
relations among categories. Unlike the previous work
2 Related Work that only consider the correlations among different cat-
Just as discussed in Section 1, the most simple method egories, Liu et al. [21] presents a semi-supervised multi-
toward multi-label learning is to divide it into a set of label learning method. It is based on constrained non-
binary classification problems, one for each category negative matrix factorization which exploits unlabeled
[6, 15, 32]. This approach suffers from a number of data as well as category correlations. Generally, in
disadvantages. One disadvantage is that it can not comparision with supervised methods, semi-supervised
scale to a large number of categories since a binary methods can effectively make use of the information pro-
classifier has to be built for each category. Another vided by unlabeled instances, and are superior partic-
disadvantage is that it does not exploit the correlations ularly when the number of training data is relatively
among different categories, because each category is small. In this paper, we propose a novel semi-supervised
treated independently. Finally, this approach may approach for multi-label learning different from [21].
face the severe unbalanced data problem especially
when the number of categories is large. When the 3 Semi-supervised Multi-label Learning by
number of categories is large, the number of “negative” Solving a Sylvester Equation
instances for each binary classification problem could We will first introduce some notations that will be used
be quite larger than the number of “positive” instances. throughout the paper. Suppose there are l labeled
Consequently, the binary classifiers is likely to output instances (x1 , y1 ), · · · , (xl , yl ), and u unlabeled instances
the “negative” labels for most “positive” instances. xl+1 , · · · , xl+u , where each xi = (xi1 , · · · , xim )T is an m-
Another direction toward multi-label learning is la- dimensional feature vector and each yi = (yi1 , · · · , yik )T
bel ranking [7–9, 25]. These approaches learn a ranking is a k-dimensional label vector. Here, we assume

411

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited


n
the label of each instance for each category is binary: 1 X p p
(3.4) Wij kfi / di − fj / dj k2
yij ∈ {0, 1}. Let n = l + u be the total number of 2 i,j=1
instances, X = (x1 , · · · , xn )T and Y = (y1 , · · · , yn )T =
Pn
(c1 , · · · , ck ). where µ is a positive constant, di = j=1 Wij
and yi = 0(i = l + 1, · · · , n). Furthermore, Belkin et
3.1 Background Our work is related to semi- al. [2] proposes a unified regularization framework for
supervised learning, for which Seeger [26], Zhu [36] give semi-supervised learning by introducing an additional
a detailed description respectively. In order to make our regularization term in Reproducing Kernel Hilbert Space
work more comprehensive, we will introduce Zhu et al.’s (RKHS).
graph-based semi-supervised learning algorithm [37].
Consider a connected graph G = (V, E) with nodes 3.2 Our Basic Framework Traditional graph-
corresponding to the n instances, where nodes L = based semi-supervised methods only construct a graph
{1, · · · , l} correspond to the labeled instances with labels on instance level, which is appropriate when there are
y1 , · · · , yl , and nodes U = {l + 1, · · · , l + u} correspond no correlations among categories. However, category
to the unlabeled instances. The object is to predict the correlations often exist in a typical multi-label learning
labels of nodes U . We define an n × n symmetric weight scenario. Therefore, in order to make use of the corre-
matrix W on the edges of the graph as follows lation information, we have another graph constructed
m on category level too. Let G0 = (V 0 , E 0 ) denote the cat-
X (xid − xjd )2
(3.1) Wij = exp(− ) egory graph with k nodes, where each node represents
σd2 one category. We define a k×k symmetric weight matrix
d=1
W 0 as the following formula
where σ1 , · · · , σm are length scale hyperparameters
for each dimension. Thus, the nearer the nodes are, the
(3.5) Wij0 = exp(−λ(1 − cos(ci , cj )))
larger the corresponding edge weight is. For reducing
parameter tuning work, we generally suppose σ1 = · · · = where λ is a hyperparameter, ci is a binary vector
σm . whose elements are set to be one when the correspond-
Define a real-valued function f : V → R that ing training instances belong to the ith category and
determines the labels of unlabel instances. We constrain zero otherwise (Please refer to the notation ci at the
that f satisfies fi = yi (i = 1, · · · , l). Assume that beginning of Section 3 ) and cos(ci , cj ) computes the
nearby points on the graph are likely to have similar Cosine Similarity between ci and cj by
labels, which motivates the choice of the quadratic
energy function hci , cj i
(3.6) cos(ci , cj ) =
kci kkcj k
n
1 X Define F = (f1 , · · · , fn )T = (g1 , · · · , gk ), and we
(3.2) E(f ) = Wij kfi − fj k2
2 i,j=1 can also obtain a quadratic energy function for category
graph
By minimizing the above energy function, the soft
labels of unlabeled instances can be computed. Further, k
1 X 0
the optimization problem can be summarized as follows (3.7) E 0 (g) = W kgi − gj k2
2 i,j=1 ij
[37]
This can also been viewed as a regularization term
l
X n
X that measures the smoothness of the labels of categories.
1 If we incorporate the regularization term for cate-
(3.3) min ∞ kfi − yi k2 + Wij kfi − fj k2
i=1
2 i,j=1 gory graph into Eq. (3.3), the category correlation in-
formation can be used effectively. This encourages us
Essentially, here the energy function as a regulariza- to propose the following graph-based semi-supervised
tion term measures the smoothness of the labels of in- algorithm for multi-label learning, i.e. SMSE1
stances. Zhou et al. [34] gives a similar semi-supervised
learning algorithm, which can be described as the fol-
Xl
lowing optimization problem
(3.8) min ∞ kfi − yi k2 + µE(f ) + νE 0 (g)
i=1
n
X
min µ kfi − yi k2 + where µ and ν are nonnegative constants that
i=1
balance E(f ) and E 0 (g). By solving Eq. (3.8), we can

412

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited


obtain the soft labels for unlabeled instances that in Therefore, Eq. (3.8) reduces to finding
fact provide a ranked list of category labels for each
unlabeled instance.
Similarly, if the regularization term for category min µtrace(F T LF ) + νtrace(F HF T )
graph is incorporated into Eq. (3.4), we can obtain (3.12) s.t. fi = yi (i = 1, · · · , l)
another semi-supervised algorithm for multi-label learn-
ing, i.e. SMSE2 In order to solve the above optimization problem,
let α = (α1 , · · · , αl )T be the l × k Lagrange multiplier
Xl matrix for the constraint fi = yi (i = 1, · · · , l). The
min kfi − yi k2 + Lagrange function Lag(F, α) becomes
i=1
n
1 X p p
β Wij kfi / di − fj / dj k2 + Lag(F, α) = µtrace(F T LF ) + νtrace(F HF T ) +
2 i,j=1
l
X
1
k
X p q (3.13) αiT (fi − yi )
(3.9) γ Wij0 kgi / d0i − gj / d0j k2 i=1
2 i,j=1
By applying the matrix properties
Pn where β and γ are nonnegative constants and d0i =
0 ∂trace(X T AX)/∂X = (A + AT )X and
j=1 Wij . ∂trace(XAX T )/∂X = X(A + AT ), the Kuhn-Turker
Next we will give the solution of Eq. (3.8) and Eq.
condition ∂Lag(F, α)/∂F = 0 becomes
(3.9).
µ ¶
1 α
3.3 Solving the SMSE (3.14) µLF + νF H + =0
2 0
3.3.1 SMSE1 First we have We split the matrix Lµ into four blocks
¶ after the
Lll Llu
lth row and column: L = and let F =
1 X
n Lul Luu
µ ¶
E(f ) = Wij kfi − fj k2 Fl
2 i,j=1 where Fu denotes the soft labels of unlabeled
Fu
n
1 X instances. So the following equation can be derived from
= Wij (fiT fi + fjT fj − 2fiT fj ) Eq. (3.14)
2 i,j=1
n n n
1 X X X (3.15) µLul Fl + µLuu Fu + νFu H = 0
= ( di fiT fi + dj fjT fj − 2 Wij fiT fj )
2 i=1 j=1 i,j=1 The above matrix equation is called Sylvester Equa-
T tion which often occurs in the control domain. We first
= trace(F (D − W )F )
= trace(F T LF ) discuss the solutions of Sylvester Equation
(3.10) (3.16) AX + XB = C
Pn
where di = j=1 W ij , D = diag(d i ) and L =
m×m
D − W . Here L is called the combinatorial Laplacian, where A ∈ R , B ∈ Rn×n and X, C ∈ Rm×n .
and obviously is symmetric.
Similarly, Theorem 3.1. Eq. (3.16) has a solution if and only if
the matrices
k
µ ¶ µ ¶
0 1 X 0 2 (3.17)
A 0
and
A C
E (g) = W kgi − gj k 0 −B 0 −B
2 i,j=1 ij

= trace(F (D0 − W 0 )F T ) are similar.


(3.11) = trace(F HF T )
Pk Theorem 3.2. When Eq. (3.16) is solvable, it has a
where D0 = diag(d0i ), d0i = 0
j=1 Wij and H = unique solution if and only if the eigenvalues δ1 , · · · , δu
0 0
D − W . H is the combinatorial Laplacian of category of A and γ1 , · · · , γk of B satisfy δi + γj 6= 0 (i =
graph. 1, · · · , u; j = 1, · · · , k) .

413

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited


Please see [18] for the proofs of Thm. 3.1 and 3.2. solving SMSE2 do not refer to block matrices but
Eq. (3.15) has a unique solution Fu if it satisfies need more computational expense since the number of
the above conditions that usually easily occur in the variables increases from u×k to n×k. However, SMSE1
practical multi-label learning problems. cannot be applied in some cases where no natural block
Here, an iterative Krylov-subspace method is matrix exists while SMSE2 can. In Section 6, we will
adopted to solve Sylvester Equation. Please see [13] for give such an application.
details.
When ν = 0 and µ 6= 0, Eq. (3.15) becomes 4 Connections to Spectral Clustering
Zhu et al. [37] discussed the relations between their
(3.18) Lul Fl + Luu Fu = 0 graph-based semi-supervised learning algorithm and
This corresponds to solving the optimization prob- spectral clustering. Spectral clustering is unsupervised
lem in Eq. (3.3), so Zhu et al.’s semi-supervised learning where there is no labeled information and only depends
approach [37] can be viewed as a special case of SMSE1. on the graph weights W . On the other hand, Graph-
based semi-supervised learning algorithms maintain a
3.3.2 SMSE2 First balance between how good the clustering is, and how
n
well the labeled data can be explained by it [36].
1 X p p 2 The typical spectral clustering approach: the nor-
Wij kfi / di − fj / dj k
2 i,j=1 malized cut [27] seeks to minimize
n
1 X p y T (D − W )y
= Wij (fiT fi /di + fjT fj /dj − 2fiT fj / di dj ) min
2 i,j=1 y T Dy
T
n n n (4.23) s.t. y D1 = 0
1 X T X
T
X p
= ( fi fi + fj fj − 2 Wij fiT fj / di dj )
2 i=1 The solution y is the second smallest eigenvector of
j=1 i,j=1
1 1
the generalized eigenvalue problem Ly = λDy. Then
= trace(F T (I − D− 2 W D− 2 )F ) y is discretized to obtain the clusters. In fact, if we
= trace(F T Ln F ) add the labeled data information into Eq. (4.23) and
(3.19) simultaneously discard the scale constraint term y T Dy,
Zhu et al.’s semi-supervised learning algorithm [37] can
where Ln = I − D− 2 W D− 2 . Here, Ln is called the be immediately obtained.
1 1

normalized Laplacian and also is symmetric. Therefore, if there is not any supervised labeled in-
Similarly, formation and the graph weights W , W 0 both instance
and category can be calculated in some way, our algo-
1 X 0
k
p q rithm SMSE can reduce to do simultaneously clustering
Wij kgi / d0i − gj / d0j k2 (also called co-clustering) on two different graphs. For
2 i,j=1
example, if we apply the combinational Laplacian to do
1 1
= trace(F (I − D0− 2 W 0 D0− 2 )F T ) co-clustering, the corresponding co-clustering algorithm
(3.20) = trace(F H F ) T can be formalized as follows
n
1 1 trace(F T LF ) trace(F HF T )
where Hn = I − D0− 2 W 0 D0− 2 . Hn is the normal- min + τ
trace(F T DF ) trace(F D0 F T )
ized Laplacian of category graph.
Thus, Eq. (3.9) is converted into (4.24) s.t. fiT D1 = 0, giT D0 1 = 0
n
X where τ is a nonnegative constant. The solution F of
min kfi − yi k2 + βtrace(F T Ln F ) + γtrace(F Hn F T ) the above equation is further done row and column
i=1 clustering respectively. Thus, the clusters for both
(3.21) category and instance can be gotten. However, research
on co-clustering has gone beyond the scope of this
By applying the optimization method similar to paper, and here we only concentrate on semi-supervised
solving SMSE1, Eq. (3.21) reduces to learning.

(3.22) (βLn + I)F + γF Hn − Y = 0 5 Experiments


Obviously, the above matrix equation is also a 5.1 Data Set and Experimental Setup Our data
Sylvester Equation. Compared with solving SMSE1, set is a subset of RCV1-v2 text data, provided by

414

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited


Reuters and corrected by Lewis et al. [19]. The data set 0.37
Number of training instances = 500

includes the information of topics, regions and indus- 0.36


tries for each document and a hierarchical structure for
0.35
topics and industries. Here, we use topics as the clas-
sification tasks and simply ignore the topic hierarchy.
0.34

F1 Micro
We first randomly pick 3000 documents, then choose 0.33

words with more than 5 occurrences and topics with 0.32

more than 40 positive assignments. Finally, We have 0.31

3000 documents with 4082 words, and have 60 topics 0.3

left. On average, each topic contains 225 positive docu- 0.29

ments, and each document is assigned to 4.5 categories.


In order to reduce computational expense, we create 0 0.2 0.4
ν/µ
0.6 0.8 1

kNN graphs rather than fully connected graphs. It


means that nodes i and j are connected by an edge Figure 1: Performance of SMSE1 with respect to ν/µ
if i is in j’s k-nearest-neighborhood or vice versa.
Computation on such sparse graphs are fast. In general,
the size of neighbors and other parameters in Eq. (3.8) Number of training instances = 500

and Eq. (3.9) can be gotten by doing cross validation


on training set. In the next experiments, the sizes of 0.4

neighbors for instance graph and category graph are 17


0.35
and 8 respectively. F1 Micro
0.3

5.2 Evaluation Metrics Since our approach only 0.25


produces a ranked list of category labels for a test
instance, in this paper we focus on evaluating the quality 0.2
10

of category ranking. More concretely, we evaluate the 5


8
10
6
performance when varying the number of predicted 2
4

labels for each test instance along the ranked list of β


0 0
γ
class labels. Following [16, 32], we choose F1 Micro
measure as the evaluation metric, which can be seen as Figure 2: Performance of SMSE2 with respect to β and
the weighted average of F 1 scores over all the categories γ
(see [32] for details) . The F1 measure of the sth
category is defined as follows
2ps rs Set the hyperparameters σ12 = σ22 = · · · = σm 2
= 0.27,
(5.25) F1 (s) = λ = 10. With respect to above configurations, Fig.
ps + r s
1 shows the F 1 Micro scores of SMSE1 when varying
where ps and rs are the precision and recall of the
the value of ν/µ. When ν = 0 and µ 6= 0, SMSE1
sth category, respectively. And they can be calculated
reduces to Eq. (3.3) that only constructs an instance
by using the following equations
graph and does not consider the correlations among
V bi }| different topics. Contrarily, when ν 6= 0 and µ = 0,
|{xi |s ∈ Ci s∈C
(5.26) ps = only a category graph is used in SMSE1. From Fig.
bi }|
|{xi |s ∈ C 1, we see that by choosing the appropriate value of
V bi }| ν/µ our approach indeed makes use of the correlation
|{xi |s ∈ Ci s ∈ C
(5.27) rs = information among different categories and obviously
|{xi |s ∈ Ci }|
increases the performance compared with the algorithm
where Ci and Cbi are the ith instance xi ’s true labels which only builds an instance graph or a category graph.
and predicted labels, respectively. In practice, we only have a parameter tuned when fixing
the other parameter to 1 in SMSE1.
5.3 The Influence of Parameters We analyze the Fig. 2 shows F 1 Micro scores of SMSE2 when vary-
influence of parameters in SMSE. We randomly choose ing the values of β and γ. Similarly, by choosing appro-
500 from 3000 documents as labeled data, and the left priate values of the two parameters, we can achieve the
2500 documents as unlabeled data. The number of best predictions. However, in comparision with SMSE1,
predicted labels for each test document is assigned to 10. SMSE2 has two parameters tuned rather than one.

415

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited


5.4 Comparisons and Discussions We compare 0.4
Number of traininginstances = 500

our algorithm with three baseline models. The first


one is a semi-supervised multi-label learning method 0.35
based on Constrained Non-negative Matrix Factoriza-
tion (CNMF) [21]. The key assumption behind CNMF 0.3

F1 Micro
is that two instances tend to have large overlap in their
assigned category memberships if they share high simi- 0.25
larity in their input patterns. CNMF evaluates the in-
stance similarity matrix for instance graph from two 0.2
SMSE1
SMSE2
different viewpoints: one is based on the correlations CNMF
MLSI
between the input patterns of these two instances, the SVM

other is based on the overlap between the category la- 0 5 10 15 20 25


Number of predicted labels for each instance
30

bels of these two instances. By minimizing the differ-


(a)
ence of the two similarity matrix, CNMF can determine
the labels of unlabeled data. The second model is Sup- 0.7
Number of training instances = 2000

port Vector Machine (SVM). A linear SVM classifier is 0.65


built for each category independently. The last baseline
0.6
model is Multi-label Informed Latent Semantic Indexing
(MLSI) [33], which first maps the input features into a 0.55

F1 Micro
new feature space that retains the information of orig- 0.5

inal inputs and meanwhile captures the dependency of 0.45

the output labels, then trains a set of linear SVMs on 0.4

this projected space. Fig. 3 shows the performance of 0.35


SMSE1
SMSE2
all the five algoritms: SMSE1, SMSE2, CNMF, SVM, 0.3
CNMF
MLSI
MLSI at different ranks when the number of training 0.25
SVM

data is 500 or 2000. All the methods are tested by a 0 5 10 15 20 25


Number of predicted labels for each instance
30

10-fold experiment using the same training/test split of


(b)
the data set and the average of F 1 Micro scores for each
method is computed. It should be also noted that all
parameters contained in the five methods are chosen by Figure 3: Performance when varying the number of
grid search. From Fig. 3, we can obtain: predicted labels for each test instance along the ranked
list of category labels
1. SMSE1, SMSE2 and CNMF achieve the similar
performance in F 1 Micro, and they both are su-
semi-supervised learning the benefit provided by
perior to SVM and MLSI if we choose a proper
unlabeled data is expected to decrease with more
number of predicted labels for each test instance.
labeled data, which has been verified in many
However, in comparision with SMSE1 and SMSE2,
studies such as [26, 36, 37].
CNMF have more variables and more complicated
formulae to be calculated. The average execu- To sum up, when the amount of labeled data is
tion time for SMSE1, SMSE2 and CNMF on the relatively small especially for labeled instances are dif-
PC with 2.4GHz CPU and 1Gb RAM using Mat- ficult, expensive, or time consuming to obtain, relative
lab codes is 83.2s, 187.3s, and 423.9s respectively
to supervised algorithms, semi-supervised algorithms is
when the number of labeled data is 500, and 40.1s, generally a better choice. Here, considering the balance
199.4s and 353.3s respectively when the number of between F 1 Micro and computational efficiency, maybe
labeled data is 2000. This sufficiently demonstrates the overall performance of SMSE1 is the best in the five
SMSE’s advantage on computational expense. Just approaches for multi-label learning.
as discussed in Section 3.3.2, SMSE2 has more vari-
ables to solve than SMSE1 so that its execution
6 The Extended Application on Collaborative
time is more than that of SMSE1. Filtering
2. More performance improvement by SMSE1, 6.1 Introduction to Collaborative Filtering
SMSE2 and CNMF is observed when the number Collaborative filtering aims at predicting a test user’s
of training data is 500 than when the number ratings for new items based on a collection of other like-
of training data is 2000. This is because that in minded user’ ratings information. The key assumption

416

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited


u1
i1
1 ? 2 ?
? 1 ? ? ?
in
?
the pairwise similarity between the items. In item-based
approaches, the similarities between the test item and
? ? ? 3 ? ? 4 ? 3 ? other items are also first calculated and sorted, as a re-
sult, we can obtain the most similar K-items to the test
? 2 ? ? 1 3 ? ? ? ?
item. Then, the unknown rating is also predicted by
? ? 4 ? 3 ? 5 ? 2 1 combining the known rating of the K neighbors.
 ? 1 ? ? ? ? ? 2 ? ?
6.2 Applying SMSE2 on Collaborative filter-
? ? 3 ? ? ? 4 ? ? ?
ing In fact, collaborative filtering is quite analogous to
up 2 4 ? 1 2 ? ? 3 ? ? multi-label learning. If considering collaborative filter-
ing from the viewpoint of graph, we can construct a
user graph and an item graph respectively. The graph
Figure 4: A user-item matrix. “?” means the item is weights can be obtained by computing the similarities
not rated by the corresponding user. between pairwise user or item vectors (Here, we utilized
Eq. (3.5) to calculate graph weights). The regulariza-
tion term for user graph measures the smoothness of
is that users sharing the same ratings on past items
user vectors and the regularization term for item graph
tend to agree on new items. Various collaborative fil-
measures the smoothness of item vectors. Obviously,
tering techniques have been successfully utilized to build
by combining the two regularization terms for user and
recommender systems (e.g. movies [1] and books [20]).
item graphs, the unknown ratings can be gotten by solv-
In a typical collaborative filtering scenario, there
ing the SMSE. It should be noted that since the user-
is a p × n user-item matrix X (see Fig. 4 ), where p
item matrix does not have natural blocks (see Fig. 4),
is the number of users and n is the number of items.
only SMSE2 can be used in collaborative filtering while
Each element of X: xjm = r denotes that the jth user
SMSE1 cannot. To some extent, SMSE2 can be seen as
rates the mth item by r, where r ∈ {1, · · · , R}. When
a hybrid method for collaborative filtering that convexly
the item is not rated, xjm = ∅. The goal is usually to
combines user-based and item-based.
predict the unknown items’ ratings.
Let
6.3 Preliminary Experiments We used the Movie-
T
X = [u1 , · · · , up ] , uj = (xj1 , · · · , xjn ) , T Lens 1 data set to evaluate our algorithm. The Movie-
Lens data set is composed of 943 users and 1682 items
(6.28) j ∈ {1, · · · , p} (1-5 scales), where each user has more than 20 ratings.
Here, we extracted a subset which contained 500 users
where the vector uj indicates the jth user’s ratings to
with more than 40 ratings and 1000 items. The first 300
all items.
users in the data set are selected into training set and
Likewise, the user-item matrix X can be decom-
the left 200 users as test set. In our experiments, the
posed into column vectors
available ratings of each test user are half-and-half split
X = [i1 , · · · , in ], im = (x1m , · · · , xpm )T , into an observed set and a held out set. The observed
ratings are used to predict the held out ratings.
(6.29) m ∈ {1, · · · , n} Here, we are only concerned with ranking the
unrated data and recommending the top ones to the
where the vector im indicates all users’ ratings to the
active user. Therefore, following [31], we choose Order
mth item.
Consistency (OC) to measure how similar the predicted
Collaborative filtering approaches can be mainly
order to the true order. Assuming there are n items, v is
divided into two categories: user-based [4] and item-
the vector that these n items are sorted in an decreasing
based [24]. User-based algorithms for collaborative fil-
order according to their predicted ranking scores, v 0 is
tering aim at predicting a test user’s ratings for un-
the vector that these n items are sorted in an decreasing
known items by synthesizing the like-minded users’ in-
order according to their true ratings. For these n items,
formation. It first computes the similarities between
we have Cn2 = n(n − 1)/2 ways to randomly select a
the test user and other users, then selects the most sim-
pair of different items. A is the set whose elements are
ilar K-users to the test user by ranking the similarities.
pairwise items whose relative order in v are the same as
Finally the unknown rating is predicted by combining
the known rating of the K neighbors. Item-based al-
gorithms for collaborative filtering are similar to user-
based algorithms except for that they need to compute 1 http://www. grouplens. org/

417

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited


References
Table 1: The OC values of SMSE2, IRSM and URSM.
A larger value means a better performance
Algorithm SMSE2 IRSM URSM IB UB [1] http://movielens.umn.edu.
OC 0.820 0.785 0.782 0.719 0.711 [2] M. Belkin, P. Niyogi, and V. Sindhwani. On manifold
regularization. In Proc. of AISTATS, 2005.
[3] M. R. Boutella, X. Luoh, J.and Shen, and C. M.
in v 0 , then Order Consistency is defined as Browna. Learning multi-label scene classification.
Pattern Recognition, 37:1757–1771, 2004.
(6.30) OC = |A|/Cn2 [4] J. S. Breese, D. Heckerman, and C. Kadie. Empiri-
cal analysis of predictive algorithms for collaborative
The larger the value of OC, the better the predictions filtering. In Proc. of UAI, 1998.
are. [5] L. Cai and T. Hofmann. Hierarchical document cate-
gorization with support vector machines. In Proc. of
Recently, Wang et al. [31] proposed a novel item-
CIKM, 2004.
based recommendation scheme called Item Rating [6] E. Chang, K. Goh, G. Sychay, and G. Wu. Content-
Smoothness Maximization (IRSM). In their framework, based soft annotation for multimodal image retrieval
the items are first described by an undirected weighted using bayes point machines. IEEE Tran. on Circuits
graph, then based on Zhou et al.’s method [34], the and Systems for Video Tech. Special Issue on Concep-
unknown ratings can be predicted. Their theoretical tual and Dynamical Aspects of Multimedia Content De-
analysis and experimental results show the effectiveness scription, 13(1), 2003.
of IRSM on recommendation problems. It is easy to [7] K. Crammer and Y. Singer. A new family of online
find that IRSM is a special case of SMSE2. In IRSM, algorithms for category ranking. In Proc. of SIGIR,
the user graph is not been utilized. Similarly, if we 2002.
only consider constructing a user graph to predict the [8] O. Dekel, C. D. Manning, and Y. Singer. Log-linear
models for label ranking. In Proc. of NIPS, 2003.
unknown ratings, the other method that we call User
[9] A. Elisseeff and J. Weston. A kernel method for multi-
Rating Smoothness Maximization (URSM) is obtained. labelled classification. In Proc. of NIPS, 2001.
It is clear that URSM is also a special case of SMSE2. [10] S. Gao, W. Wu, C. H. Lee, and T. S. Chua. A mfom
Here, we compare SMSE2 with four approaches includ- learning approach to robust multiclass multi-label text
ing IRSM, URSM, traditional user-based (UB) [4] and categorization. In Proc. of ICML, 2004.
item-based (IB) [24]. Tab. 1 shows the OC values of [11] N. Ghamrawi and A. McCallum. Collective multi-label
the five algorithms. Note that all parameters are deter- classification. In Proc. of CIKM, 2005.
mined by grid search. It can be observed that SMSE2 [12] T. Griffiths and Z. Ghahramani. Infinite latent feature
is superior to other four approaches, that validates the models and the indian buffet process. In Proc. of NIPS,
effectiveness of SMSE2 on collaborative filtering. 2005.
[13] D. Y. Hu and L. Reichel. Krylov-subspace methods
for the sylvester equation. Linear Algebra and Its
7 Conclusions
Applications, (172):283–313, 1992.
In this paper we propose a novel semi-supervised al- [14] R. Jin and Z. Ghahramani. Learning with multiple
gorithm for multi-label learning by solving a Sylvester labels. In Proc. of NIPS, 2003.
Equation. Two graphs are first constructed on both [15] T. Joachims. Text categorization with support vector
instance level and category level respectively. By com- machines: Learning with many relevant features. In
bining the regularization terms for the two graphs, a Proc. of ECML, 1998.
regularization framework for multi-label learning is sug- [16] F. Kang, R. Jin, and R. Sukthankar. Correlated label
gested. The labels of unlabeled instances can be ob- propagation with application to multi-label learning.
In Proc. of CVPR, 2006.
tained by solving a Sylvester Equation. Our method
[17] H. Kazawa, T. Izumitani, H. Taira, and E. Maeda.
can exploit unlabeled data information and the corre- Maximal margin labeling for multi-topic text catego-
lations among categories. Empirical studies show that rization. In Proc. of NIPS, 2005.
our algorithm is quite competitive against state-of-the- [18] P. Lancaster and M. Tismenetsky. The theory of
art multi-label learning techniques. Additionally, we matrices: with applications. Academic Press, 1985.
successfully applied our algorithm to collaborative fil- [19] D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: a
tering. new benchmark collection for text categorization re-
In the future, we will further study SMSE2’s overall search. Journal of Machine Learning Research, 5:361–
performance on collaborative filtering and develop more 397, 2004.
effective multi-label learning approaches. [20] G. Linden, B. Smith, and J. York. Amazon.com
recommendations: Item-to-item collaborative filter-

418

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited


ing. IEEE Internet Cmoputing, pages 76–80, January-
February 2003.
[21] Y. Liu, R. Jin, and L. Yang. Semi-supervised multi-
label learning by constrained non-negative matrix fac-
torization. In Proc. of AAAI, 2006.
[22] A. McCallum. Multi-label text classification with a
mixture model trained by em. In Proc. of AAAI
Workshop on Text Learning, 1999.
[23] J. Rousu, C. Saunders, S. Szedmak, and J. Shawe-
Taylor. On maximum margin hierarchical multi-label
classification. In Proc. of NIPS Workshop on Learning
With Structured Outputs, 2004.
[24] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl.
Item-based collaborative filtering recommendation al-
gorithms. In Proc. of WWW, 2001.
[25] R. E. Schapire and Y. Singer. Boostexter: A boosting-
based system for text categorization. Machine Learn-
ing, 39(2-3), 2000.
[26] M. Seeger. Learning with labeled and unlabeled data.
Techinial report, University of Edinburgh, 2001.
[27] J. Shi and J. Malik. Normalized cuts and image
segmentation. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 22:888–905, 2000.
[28] B. Tasker, V. Chatalbashev, and D. Koller. Learning
associative markov networks. In Proc. of ICML, 2004.
[29] I. Tsochantaridis, T. Hofmann, T. Joachims, and
Y. Altun. Support vector machine learning for inter-
dependent and structured output spaces. In Proc. of
ICML, 2004.
[30] N. Ueda and K. Saito. Parametric metric models for
multi-labelled text. In Proc. of NIPS, 2002.
[31] F. Wang, S. Ma, L. Yang, and T. Li. Recommendation
on item graphs. In Proc. of ICDM, 2006.
[32] Y. Yang. An evaluation of statistical approaches to
text categorization. Information Retrieval, 1(1/2),
1999.
[33] K. Yu, S. Yu, and V. Tresp. Multi-label informed latent
semantic indexing. In Proc. of SIGIR, 2005.
[34] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and
B. Scholkopf. Learning with local and global consis-
tency. In Proc. of NIPS, 2003.
[35] S. Zhu, X. Ji, W. Xu, and Y. Gong. Multi-labelled
classification using maximum entropy method. In
Proc. of SIGIR, 2005.
[36] X. Zhu. Semi-supervised learning literature survey.
Techinial Report TR 1530, University of Wisconsin-
Madision, 2006.
[37] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-
supervised learning using gaussian random fields and
harmonic functions. In Proc. of ICML, 2003.

419

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited

You might also like