A Hierarchical Attention Model For Social Contextual Image Recommendation
A Hierarchical Attention Model For Social Contextual Image Recommendation
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2913394, IEEE
Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1
Abstract—Image based social networks are among the most popular social networking services in recent years. With tremendous
images uploaded everyday, understanding users’ preferences on user-generated images and making recommendations have become
an urgent need. In fact, many hybrid models have been proposed to fuse various kinds of side information (e.g., image visual
representation, social network) and user-item historical behavior for enhancing recommendation performance. However, due to the
unique characteristics of the user generated images in social image platforms, the previous studies failed to capture the complex
aspects that influence users’ preferences in a unified framework. Moreover, most of these hybrid models relied on predefined weights
in combining different kinds of information, which usually resulted in sub-optimal recommendation performance. To this end, in this
paper, we develop a hierarchical attention model for social contextual image recommendation. In addition to basic latent user interest
modeling in the popular matrix factorization based recommendation, we identify three key aspects (i.e., upload history, social influence,
and owner admiration) that affect each user’s latent preferences, where each aspect summarizes a contextual factor from the complex
relationships between users and images. After that, we design a hierarchical attention network that naturally mirrors the hierarchical
relationship (elements in each aspects level, and the aspect level) of users’ latent interests with the identified key aspects. Specifically,
by taking embeddings from state-of-the-art deep learning models that are tailored for each kind of data, the hierarchical attention
network could learn to attend differently to more or less content. Finally, extensive experimental results on real-world datasets clearly
show the superiority of our proposed model.
1 I NTRODUCTION
There is an old saying “a picture is worth a thousand extreme data sparsity of the user-image interaction behavior
words”. When it comes to social media, it turns out that limits the recommendation performance [2], [26]. On one
visual images are growing much more popularity to at- hand, some recent works proposed to enhance recommen-
tract users [14]. Especially with the increasing adoption dation performance with visual contents learned from a
of smartphones, users could easily take qualified images (pre-trained) deep neural network [18], [49], [5]. On the
and upload them to various social image platforms to other hand, as users perform image preferences in social
share these visually appealing pictures with others. Many platforms, some social based recommendation algorithms
image-based social sharing services have emerged, such as utilized the social influence among users to alleviate data
Instagram1 , Pinterest2 , and Flickr3 . With hundreds of millions sparsity for better recommendation [33], [24], [3]. In sum-
of images uploaded everyday, image recommendation has mary, these studies partially solved the data sparsity issue
become an urgent need to deal with the image overload of social-based image recommendation. Nevertheless, the
problem. By providing personalized image suggestions to problem of how to better exploit the unique characteristics
each active user in image recommender system, users gain of the social image platforms in a holistical way to enhance
more satisfaction for platform prosperity. E.g., as reported recommendation performance is still under explored.
by Pinterest, image recommendation powers over 40% of In this paper, we study the problem of understanding
user engagement of this social platform [30]. users’ preferences for images and recommending images in
Naturally, the standard recommendation algorithms pro- social image based platforms. Fig. 1 shows an example of
vide a direct solution for the image recommendation a typical social image application. Each image is associated
task [2]. For example, many classical latent factor based with visual information. Besides showing likeness to im-
Collaborative Filtering (CF) algorithms in recommender ages, users are also creators of these images with the upload
systems could be applied to deal with user-image inter- behavior. In addition, users connect with others to form a
action matrix [26], [40], [26]. Successful as they are, the social network to share their image preferences. The rich
heterogeneous contextual data provides valuable clues to
• L. Wu, L. Chen, R. Hong, M. Wang are with the School of Computer and
infer users’ preferences to images. Given rich heterogeneous
Information, Hefei University of Technology, Hefei, Anhui 230009, China. contextual data, the problem of how to summarize the
Emails: {lewu.ustc, chenlei182979,hongrc.hfut,eric.mengwang}@gmail.com. heterogeneous social contextual aspects that influence users’
• Y. Fu is with the Department of Computer Science, University of preferences to these highly subjective content is still unclear.
Missouri-Rolla, Rolla, MO, USA. Email: [email protected].
• X. Xie is with Microsoft Research, Beijing, China. What’s more, in the preference decision process, different
Email: [email protected]. users care about different social contextual aspects for their
1. https://www.instagram.com personalized image preference. E.g. Lily likes images that are
2. https://www.pinterest.com similar to her uploaded images, while Bob is easily swayed
3. https://www.flickr.com by social neighbors to present similar preference as her
1041-4347 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2913394, IEEE
Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2
Fig. 1. An overall framework of social contextual image recommendation, where the left part shows the data characteristics of the
platform, and the right part shows our proposed model.
social friends. In other words, the unique user preference cial contextual aspects that affect users’ preferences
for balancing these complex social contextual aspect makes from heterogeneous data sources.
the recommendation problem more challenging. 2) We design a hierarchical attention network to model
To address the challenges mentioned above, in this the hierarchical structure of social contextual recom-
paper, we design a hierarchical attention model for social mendation. In the attention networks, we feed em-
image recommendation. The proposed model is built on the beddings from state-of-the-art deep learning models
popular latent factor based models, which assumes users that are tailored for each kind of data into the
and items could be projected in a low latent space [34]. attention networks. Thus, the attention networks
In our proposed model, for each user, in addition to basic could learn to attend differently based on the rich
latent user interest vector, we identify three key aspects (i.e., contextual information for user interest modeling.
upload history, social influence and owner admiration) that 3) We conduct extensive experiments on real-world
affect each user’s preference, where each aspect summarizes datasets. The experimental results clearly show the
a contextual factor from the complex relationships between effectiveness of our proposed model.
users and images. Specifically, the upload history aspect
summarizes each user’s uploaded images to characterize
her interest. The social influence aspect characterizes the 2 R ELATED W ORK
influence from the social network structure, and the owner We summarize the related work in the following four cate-
admiration aspect depicts the influence from the uploader gories.
of the recommended image. The three key aspects are General Recommendation. Recommender systems
combined to form the auxiliary user latent embedding. could be classified into three categories: content based meth-
Furthermore, since not all aspects are equally important for ods, Collaborative Filtering (CF) and the hybrid models [2].
personalized image recommendation, we design a hierar- Among all models for building recommender systems, la-
chical attention structure that attentively weight different tent factor based models from the CF category are among
aspects for each user’s auxiliary embedding. The proposed the most popular techniques due to their relatively high
hierarchical structure aims at capturing the following two performance in practice [40], [34], [39]. These latent factor
distinctive characteristics. First, as social contextual recom- based models decomposed both users and items in a low
mendation naturally exhibits the hierarchical structure (var- latent space, and the preference of a user to an item could
ious elements from each aspect, and the three aspects of each be approximated as the inner product between the corre-
user), we likewise construct user interest representation sponding user and item latent vectors. In the real-world
with a hierarchical structure. In the hierarchical structure, applications, instead of the explicit ratings, users usually
we first build auxiliary aspect representations of each user, implicitly express their opinions through action or inaction.
and then aggregate the three aspect representations into an Bayesian Personalized Ranking (BPR) is such a popular
auxiliary user interest vector. Second, as different elements latent factor based model that deals with the implicit feed-
within each aspect, and different aspects are differentially back [40]. Specifically, BPR optimized a pairwise based
informative for each user in the recommendation process, ranking loss, such that the observed implicit feedbacks are
the hierarchical attention network builds two levels of at- preferred to rank higher than that of the unobserved ones.
tention mechanisms that apply at the element level and the As users may simultaneously express their opinions with
aspect level. several kinds of feedbacks (e.g., click behavior, consumption
We summarize the contributions of this paper as follows: behavior). SVD++ is proposed to incorporate users’ different
feedbacks by extending the classical latent factor based
1) We study the problem of image recommendation models, assuming each user’s latent factor is composed of
in social image based platforms. By considering the a base latent factor, and an auxiliary latent factor that can
uniqueness of these platforms, we identify three so- be derived from other kinds of feedbacks [26]. Due to the
1041-4347 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2913394, IEEE
Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 3
performance improvement and extensibility of SVD++, it is the graph structural information into a low latent space,
widely studied to incorporate different kinds of information, such that each node is represented as an embedding in this
e.g., item text [58], multi-class preference of users [36]. latent space. Many network embedding models have been
Image Recommendation. In many image based so- proposed [37], [44], [48], [47]. The network embedding could
cial networks, images are associated with rich context in- be used for the attention networks. We distinguish from
formation, e.g., the text in the image, the hashtags. Re- these works as the focus of this paper is not to advance
searchers proposed to apply factorization machines for im- the sophisticated network embedding models. We put em-
age recommendation by considering the rich context in- phasis on how to enhance recommendation performance by
formation [6]. Recently, deep Convolutional Neural Net- leveraging various data embeddings.
works(CNNs) have been successfully applied to analyzing Attention Mechanism. Neural science studies have
visual imagery by automatic image representation in the shown that people focus on specific parts of the input rather
modeling process [27]. Thus, it is a natural idea to leverage than using all available information [22]. Attention mech-
visual features of CNNs to enhance image recommendation anism is such an intuitive idea that automatically models
performance [18], [28], [17], [5]. E.g., VBPR is an extension of and selects the most pertinent piece of information, which
BPR for image recommendation, on top of which it learned learns to assign attentive weights for a set of inputs, with
an additional visual dimension from CNN that modeled higher (lower) weights indicate that the corresponding in-
users’ visual preferences [18]. There are some other image puts are more informative to generate the output. Attention
recommendation models that tackled the temporal dynam- mechanism is widely used in many neural network based
ics of users’ preferences to images over time [17], or users’ tasks, such as machine translation [4] and image caption-
location preferences for image recommendation [35], [49], ing [53]. Recently, the attention mechanism is also widely
[35]. As well studied in the computer vision community, in used for recommender systems [19], [52], [43], [41]. Given
parallel to the visual content information from deep CNNs, the classical collaborative filtering scenario with user-item
images convey rich style information. Researchers showed interaction behavior, NAIS extended the classical item based
that many brands post images that show the philosophy recommendation models by distinguishing the importance
and lifestyle of a brand [14], images posted by users also of different historical items in a user profile [19]. With users’
reflect users’ personality [13]. Recently, Gatys et al. proposed temporal behavior, the attention networks were proposed
a new model of extracting image styles based on the feature to learn which historical behavior is more important for
maps of convolutional neural networks [10]. The proposed the user’s current temporal decision [31], [32]. A lot of
model showed high perceptual quality for extracting image attention based recommendation models have been devel-
style, and has been successfully applied to related tasks, oped to better exploit the auxiliary information to improve
such as image style transfer [11], and high-resolution image recommendation performance. E.g., ANSR is proposed with
stylisation [12]. We argue that the visual image style also a social attention module to learn adaptive social influence
plays a vital role for evaluating users’ visual experience in strength for social recommendation [43]. Given the review
recommender systems. Thus, we leverage both the image or the text of an item, attention networks were developed
content and the image style for recommendation. to learn informative sentences or words for recommenda-
Social Contextual Recommendation. Social scientists tion [15], [41]. While the above models perform the standard
have long converged that a user’s preference is similar to vanilla attention to learn to attend on a specific piece of
or influenced by her social connections, with the social information, the co-attention mechanism is concerned to
theories of homophily and social influence [3]. With the learn attention weights from two sequences [21], [56], [46].
prevalence of social networks, a popular research direction E.g., in the hashtag recommendation with both text and
is to leverage the social data to improve recommendation image information, the co-attention network is designed to
performance [33], [23], [24], [51]. E.g., Ma et al. proposed a learn which part of the text is distinctive for images, and si-
latent factor based model with social regularization terms multaneously the important visual features for the text [56].
for recommendation [33]. Since most of these social recom- Besides, researchers have made a comprehensive survey the
mendation tasks are formulated as non-convex optimizing attention based recommendation models [57]. In some real-
problems, researchers have designed an unsupervised deep world applications, there exists hierarchical structure among
learning model to initialize model parameters for better the data, several pioneering works have been proposed to
performance [9]. Besides, ContextMF is proposed to fuse deal with this kind of relationship [54], [29]. E.g., a hierar-
the individual preference and interpersonal influence with chical attention model is proposed to model the hierarchical
auxiliary text content information from social networks [24]. relationships of word, sentence and document for document
As the implicit influence of trusts and ratings are valuable classification [54]. Our work borrows ideas from the atten-
for recommendation, TrustSVD is proposed to incorporate tion mechanism, and we extend this idea by designing a
the influence of trusted users on the prediction of items for hierarchical structure to model the complex social contex-
an active user [16]. The proposed technique extended the tual aspects that influence users’ preferences. Nevertheless,
SVD++ with social trust information. Social recommenda- different from the natural hierarchical structure of words,
tion has also been considered with social circle [38], online sentences and documents in natural language processing,
social recommendation [59], social network evolution [50], the hierarchial structure that influences a user’s decision
and so on. from complex heterogenous data sources is summarized by
Besides, as the social network could be seen as a graph, our proposed model. Specifically, our proposed model has
the recent surge of network embedding is also closely re- a two-layered hierarchical structure with the bottom layer
lated to our work [8]. Network embedding models encode attention network that summarizes each aspect from the
1041-4347 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2913394, IEEE
Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 4
1041-4347 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2913394, IEEE
Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 5
on the powerful feature spaces learned by convolutional Specifically, in the above definition, sa , la , and Ci denotes
neural networks, with the assumption that the styles are the inputs of the three social contextual aspects, i.e., upload
agnostic to the spatial information in the image hidden rep- history aspect, social influence aspect and the creator admi-
resentations. With the trained VGG19 architecture, suppose ration aspect.
a layer l has Nl distinct filter feature maps, each of which In the following of this paper, we use bold capital letters
is vectorized into a size of Ml . Let Bl ∈ RNl ×Ml denotes the to denote matrices, and small bold letters to denote vectors.
filter at layer l, with bljk is the activation of the j-th filter at For any matrix (e.g., social graph S), its i-th column vector
position k . A summary Gram statistic is proposed to discard is denoted as the corresponding small letter with a subscript
the spatial information in the feature maps by computing index i (e.g., the i-th column of S is denoted as sa ). We list
their relations as: some mathematical notations in Table 1.
l
X
gij = blik bljk (1) TABLE 1
k Mathematical Notations
l Notations Description
where Gl ∈ RNl ×Nl is the Gram matrix, with gij denotes the
U userset, |U | = M
correlation between feature map i and j in layer l. Natu-
V imageset, |V | = N
rally, the set of Gram matrices G1 , G2 , ..., GL from different a,b,c,u user
layers of VGG19 provides descriptions of the image style. i,j,k,v image
In practice, researchers found that the style representations R ∈ RM ×N rating matrix, with rai denotes
on layers ‘con1 1’, ‘conv2 1’, ‘con3 1’, ‘con4 1’ and ‘con5 1’ whether a likes image i
S ∈ RM ×M social network matrix, with sba denotes
can well represent the textures of an image [11], [12]. As the whether a follows b
sizes of these Gram matrices are very large, we downsample L ∈ RN ×M upload matrix, with lia denotes
each Gram matrix into a fixed size of 32 × 32, and then whether a uploads image i
concatenate the vector representation of the downsampled sa ∈ RM the a-th column of S,
which denotes the social connections of a
Gram matrices of the five layers. Since there are 5 Gram
la ∈ RN the a-th column of L,
matrices and each each vectorized Gram matrix has 1024 which denotes the uploaded history of a
dimensions, the style representation fis of image i has 5120 Ci ∈ U the creator (owner) of image i, Ci = [a : Lai = 1]
dimensions. ea the social embedding of user a from
social embedding matrix E ∈ Rd×M
fic the visual content representation of image i
3.2 Problem Definition fis the visual style representation of image i
Given the social matrix S and upload matrix L, we identify
three key social contextual aspects, i.e., social influence, 4 T HE P ROPOSED M ODEL
upload history, and the creator admiration that may in- In this section, we present our proposed Hierarchical
fluence users’ preferences. Specifically, the social influence Attentive Social Contextual recommendation (HASC) model
aspect from each user a’s social network structure sa is well for image recommendation.
recognized as an important factor in the recommendation As shown in Fig. 3, HASC is a hierarchical neural net-
process [33], [24]. The social influence states that, each active work that models users’ preferences for to unknown images
user is influenced by her social connections, leading to the from two attention levels with social contextual modeling.
similar preferences between social connections [3]. Besides, The top layered attention network depicts the importance
for each user-item pair (a, i), we could get an upload history of the three contextual aspects (i.e., upload history, social
list la of user a, and the creator Ci of image i from the up- influence and creator admiration) for users’ decision, which
load matrix L. Based on this observation, we design the two is derived from the bottom layered attention networks that
contextual aspects in users’ preference decision process: an aggregate the complex elements within each aspect. Given a
upload history aspect that explains the consistency between user a and an image i with three identified social contextual
her upload history la and her preference for images, and the aspects, we use γal (l = 1, 2, 3) to denote a’s attentive
creator admiration aspect that shows the admiration from the degree for aspect l on the top layer (denoted as the aspect
creator Ci . These three contextual aspects characterise each importance attention with orange part in the figure). A large
user’s implicit feedback to images from various contextual attentive degree denotes the current user cares more about
situations from the heterogeneous social image data. Now, this aspect in image recommendation process. Besides, as
we define the social contextual image recommendation there are various elements within the upload history context
problem as: la and social influence context sa . We use αaj to denote a’s
Definition 1. [PROBLEM DEFINITION] Given the user preference degree for image j in the upload history context
rating matrix R, the upload matrix L, and the social la (lja = 1), with a larger value of αaj indicates that a’s
network S in a social image platform, with the social current interest is more coherent with uploaded image j by
embedding ea of each user a, and the content repre- user a. Similarly, we use βab to denote the influence strength
sentation fic and style representation fis of each image of the b to a in social neighbor context sa (sba = 1), with
i, the social contextual recommendation task aims at: a larger value of βab indicates that a is more likely to be
predicting each user a’s unknown preference for image influenced by b. Please note that, for each user a and image
i with the three social contextual aspects (sa , la , Ci ) and i, different from the upload history aspect and the social
the heterogeneous data embeddings (ea , fic and fis ) as influence aspect, the creator admiration aspect is composed
g(a, i, sa , la , Ci , ea , fac , fas ); of one element Ci (the creator). Thus, this aspect does not
1041-4347 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2913394, IEEE
Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 6
have any sub layers and it is directly sent to the top layer. attentive weights (γal , αaj , and βab ) rely on our carefully
We use three attention sub-networks to learn these attentive designed attention networks that take various information
scores in a unified model. as input. We leave the details of how to model these three
Objective Prediction Function. In addition to param- attention networks in the following subsections. Next, we
eterize each user a with a base embedding pa and each show the soundness of the objective predicted function.
item i with a base embedding wi as many latent factor Relations to Other Models. By rewriting the predicted
based models [40], [26], we also take the inputs of the preference score in Eq (2), we have:
three social contextual aspects: sa , la , and Ci . To model the
complex contextual aspects, we extend the classical latent Basic Latent Factor Model Item Neighborhood Model
factor models and assume each user and each item has two z }| { N
X
embeddings. Specifically, each user a is associated with a r̂ai = pTa wi + γa1 αaj lja xTj wi
j=1
base embedding pa from the base embedding matrix P | {z }
to denote her base latent interest in the standard latent M
factor based models, and an auxiliary embedding vector qa
X
+ γa2 sba βab qTb wi + γa3 qTCi wi ,
from the auxiliary embedding matrix Q . This auxiliary user b=1
| {z }
embedding vector characterizes each user’s preference from
| {z } Owner Admiration Bias
Social Neighborhood Model
the social contextual aspects that could not be detected by
(3)
standard user-image rating behavior. Similarly, each image i
is also associated with two embeddings: a base embedding where the first part is a basic latent factor model, and the
wi from the item base embedding matrix W to denote following three parts are extracted from the three contextual
the basic image latent vector, and an auxiliary vector xi aspects. In the last three terms, xTj wi can be seen as the
from the item auxiliary embedding matrix X to characterize similarity function between image i and the user’s uploaded
each image from the social contextual inputs. Thus, by image j in the neighborhood-based collaborative filtering
combining the attention mechanism with the embeddings, from the upload history aspect [26]. qTb wi represents the
we model each user a’s predicted preference to image i as a social neighbor’s preference to image i with the social
hierarchical attention: influence aspect. As each image is uploaded by a creator,
the last term models the creator admiration aspect. This is
r̂ai = wiT (pa + γa1 x
ea + γa2 q
ea + γa3 qCi )
quite natural in the real-world, as we always like to follow
N M
X X some specific creators’ updates.
where x
ea = lja αaj xj , q
ea = sba βab qb . (2)
j=1 b=1
Please note that, if we replace all the attention scores
In the above prediction function, the representations of with equal weights (i.e., αai = PN 1 l , βab = P 1M sba ,
j=1 ja b=1
three contextual aspects are seamlessly incorporated in a and γal = 13 ), our model turns to an enhanced SVD++ model
holistic way. Specifically, the first line of Eq.(2) is a top with rich social contextual information modeling [26], [58].
layer attention network that aggregates the three contex- However, this fixed weight assignment treats each user,
tual aspects for user embedding. The detailed attention each aspect, and the elements in each aspect equally. This
subnetworks of the upload history attention and the social simply configuration neglects that each user has different
influence attention are listed in the second row. In fact, the considerations for these three contextual aspects. By using
1041-4347 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2913394, IEEE
Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 7
hierarchical attention networks, we could learn each user’s of this paper, for ease of explanation, we also omit the
attentive weights from their historical behaviors. dimension reduction for the visual embeddings (i.e., Wc
and Ws ) whenever they are appeared for the attention
4.1 Hierarchical Attention Network Modeling modeling. Then, the final attentive upload history score αaj
In this subsection, we would follow the bottom-up step to is obtained by normalizing the above attention scores as:
model the hierarchical attention networks in detail. Specif- exp(αaj )
αaj = PN . (5)
ically, we would first introduce the two bottom layered k=1 exp(lka αak )
attention networks: the upload history attention network
After we obtain the attentive upload history score αaj ,
and the social influence attention network, followed by the
the upload history context of user a, denoted as x
ea , is cal-
top layered aspect importance attention network that is
culated as a weighted combination of the learned attentive
based on the bottom layered attention networks.
upload history scores:
Upload History Attention. The goal of the upload his- N
tory attention is to select the images from each user a’s x
ea =
X
lja αaj xj . (6)
upload history that are representative to a’s preferences, and j=1
then aggregate this upload history contextual information to
Social Influence Attention. The social influence atten-
characterize each user. Given each image j that is uploaded
tion module tries to select the influential social neighbors
by a, we model the upload history attentive score αaj as a
from each user a’s social connections, and then summarizes
three-layered attention neural network:
these social neighbors’ influences into a social contextual
αaj = w1 × σ(W1 [pa , qa , xj , wj , ea , vector. If user a follows b, we use βab to denote the social
Wc fjc , Ws fjs , Wc fac , Ws fas ]) (4) influence strength of b to a. Then, the social attentive score
βab could be calculated as:
where Θu = [Wc , Ws , W1 , w1 ] is the parameter set in this βab = w2 σ(W2 [pa , pb , qa , qb , ea , eb , fac , fas ]), (7)
three layered attention network, and σ(x) is a non-linear
activation function. Specifically, as the dimensions of visual where Θs = [W2 , w2 ] are the parameters in the social
content embeddings (i.e., fjc and fac ) and the visual style influence attention network. This social influence attention
embeddings (i.e., fjs and fas ) are much higher than the part also contains three kinds of data embeddings: the user
dimensions of other kinds of embeddings, Wc ∈ RD×4096 interest embeddings of pa , pb , qa , qb , the social embeddings
and Ws ∈ RD×5120 are the parameters of the bottom layer of ea and eb , and the visual embeddings of user a with
that performs dimension reduction of the visual content and content representation fac and style representation fas .
style representations. W1 ∈ R(8D+d)×d1 denotes the matrix Then, the final attentive social influence score βab is
parameter of the second layer in the attention network, as all obtained by normalizing the above attention scores as:
data embedding vectors has D dimensions except the social exp(βab )
embedding ea has d1 dimensions. And w1 ∈ Rd1 is the βab = PM . (8)
c=1 exp(sca βac )
vector parameter of the third layer in the attention network.
In this attention modeling process, we take three different After we obtain the attentive social influence score βab ,
kinds of embeddings as input: the social context of user a, denoted as qea , is calculated as
the a weighted combination as:
• Latent Embedding: the latent embedding includes M
[pa , qa , xj , wj ], where pa and qa are the basic and
X
qea = sba βab qb . (9)
auxiliary embeddings of user a, and xj and wj are b=1
the basic and auxiliary embeddings of item j . Since each image is uploaded by one creator, for each
• Social Embedding: the social embedding part con- image i, the corresponding uploader is represented as Ci .
tains the learned social embedding ea of each user, Correspondingly, the owner appreciation context could be
which models the global and local structure of each simply represented as the the auxiliary embedding qCi from
user in the social network S. the user auxiliary embedding matrix Q.
• Visual Embedding: the visual embedding part in- Aspect Importance Attention Network. The aspect im-
cludes the visual representations of user a and item portance attention network takes the contextual represen-
j . Specifically, each image is characterized by con- tation of each aspect from the bottom layered attention
tent representation fjc and style representation fjs . networks as input, and models the importance of each
Besides, as users show their preferences for images aspect in the user’s decision process. Specifically, for each
from their historical implicit feedbacks, each user a’s pair of user a and image i, we have two contextual rep-
visual content representation and style PN
representa- resentations from the bottom layer of HASC as: upload
r fc
c
tion can also be summarized as: fa = PN rai i , fas =
i=1
history contextual representation x e a , the social influence
PN i=1 ai
s
i=1 rai fi
. contextual representation q e a , and the owner appreciation
P N
r
i=1 ai contextual representation qCi . Then, the aspect importance
By feeding all the sophisticated designed embeddings score γal (l=1, 2, 3) is modeled with an aspect importance
from heterogeneous data sources as the input, the upload attention network as:
history attention network learns to focus on the specific γal = w3 σ(W3 al ), (10)
information. Please note that, we omit the bias terms in
the attention network without confusion. In the following
1041-4347 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2913394, IEEE
Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 8
3 3
where Θa = [W , w ] is the parameter set of this attention Algorithm 1 The learning algorithm of HASC
network, and al (l = 1, 2, 3) denotes the input of the top Input: Rating matrix R, social matrix S, Uploader matrix L;
layered attention network, which is the output of the bottom batch size m; max epoch T ;
layered attention networks, i.e., a1 = x fa is the upload Output: Latent embedding matrix Θ1 = [P, Q, W, X] and
parameters in the attention networks Θ2 ;
history contextual representation, a2 = q fa is the social 1: Initialize Θ with a Gaussian distribution with a mean of
influence contextual representation, and a3 = qa denotes 0 and a standard variation of 0.1;
the representation of current active user a. 2: for epoch ← 1 to T do
Then, the final aspect importance score γal is obtained 3: Get training data D with randomly selected 5 times
by normalizing the above attention scores as: negative feedbacks< a, i, j > (a ∈ U, i ∈ Ra , j ∈ V −Ra );
|D|
exp(γal ) 4: for mini epoch ← 1 to m do
γal = P3 . (11) 5: Get mini batch : randomly select m pairs
k=1 exp(γak ) < ak , ik , j k >m
k=1 in the training data;
For each user a, the learned aspect importance scores 6: for Each pair < ak , ik , j k > in the mini batch do
are tailored to each user, which distinguish the importance 7: Compute predicted rating of positive
of the three social contextual aspects in the user’s decision item r̂ai (Eq.(2));
8: Compute predicted rating of negative
process. For all learned aspect importance scores, the larger item r̂aj (Eq.(2));
the value, the more likely the user’s decision is influenced 9: Compute the loss Lk (Eq.(12));
by this corresponding social contextual aspect. 10: end for
1 Pm k
11: Update Θ with loss as m k=1 L ;
12: end for
4.2 Model Learning 13: end for
As we focus on implicit feedbacks of users, similar as the 14: Return Θ1 = [P, Q, W, X] and parameters in the atten-
tion Θ2 .
widely used ranking based loss function in ranking based
latent factor models [40], we also design a ranking based
loss function as: dataset from one of the largest social image sharing platform
M
min L =
X X
s(r̂ai − r̂aj ) + λ||Θ1 || 2
(12)
Flickr, which is extended from the widely used NUS-WIDE
Θ
a=1 (i,j)∈Da
dataset [7], [45]. NUS-WIDE contains nearly 270,000 images
with 81 human defined categories from Flickr. Based on
where s(x) is a sigmoid function that transforms the input this initial data, we get the uploader information according
into range (0, 1). Θ = [Θ1 , Θ2 ], with Θ1 = [P, Q, W, X] denotes to the image IDs provided in NUS-WIDE dataset from the
the embedding matrices and Θ2 = [Θu , Θs , Θa ] denotes the public APIs of Flickr. We treat all the uploaders as the initial
parameters in each attention network. λ is a regularization userset, and the associated images as the imageset. We then
term that regularizes the user and image embeddings. Da = crawl the social network of the userset, and the implicit
{(i, j)|i ∈ Ra ∧j ∈ V − Ra } is the training data for a with Ra feedbacks of the userset to the imageset.
the imageset that a positively shows feedback. After data collection, in data preprocessing process, we
All the parameters in the above loss function are differ- filter out users that have less than 2 rating records and 2
entiable. In practice, we implement HASC with TensorFlow social links. We also filter out images that have less than
to train model parameters with mini batch Adam. The 2 records. We call the filtered dataset as F L. As shown in
detailed training algorithm is shown in Algorithm 1. In prac- Table 2, this dataset is very sparse with about 0.15% density.
tice, we could only observe positive feedbacks of users with Besides, we further filter F L dataset to ensure each user
huge missing unobserved values, similar as many implicit and each image have at least 10 rating records. This leads
feedback works, for each positive feedback, we randomly to a smaller but denser dataset as F S. Table 2 shows the
sample 5 missing unobserved feedbacks as pseudo negative statistics of the two datasets after pruning. Please note that
feedbacks at each iteration in the training process [50], [49], the number of images is much more than that of the users.
[5]. As each iteration the pseudo negative samples change, This is consistent with the observation that the number of
each missing value gives very weak negative signal. images usually far exceeds that of users in social image
platforms [1], as each user could be a creator to upload
5 E XPERIMENTS
multiple images. In data splitting process, we follow the
In this section, we show the effectiveness of our proposed leave-one-out procedure in many research works [5], [20].
HASC model. Specifically, we would answer the following Specifically, for each user, we select the last rating record as
questions: Q1: How does our proposed model perform com- the test data, and the remaining data are used as the training
pared to the baselines (Sec. 5.2)? Q2: How does the model data. To tune model parameters, we randomly select 5% of
perform under different sparsity (Sec. 5.3)? Q3: How does the training data to constitute the validation dataset.
the proposed social contextual aspects and the hierarchical
attention perform (Sec. 5.4)? TABLE 2
The statistics of the two datasets.
Dataset Users Images Ratings Social Links Rating Density
5.1 Experimental Settings F S 4,418 31,460 761,812 184,991 0.55%
Dataset. To the best of our knowledge, there is no public F L 8,358 105,648 1,323,963, 378,713 0.15%
available dataset that contains heterogenous data sources in Evaluation Metrics Since we focus on recommending
a social image based network as described in Fig. 1. To show images to users, we use two widely adopted ranking metric
the effectiveness of our proposed model, we crawl a large for top-K recommendation evaluation: the Hit Ratio (HR)
1041-4347 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2913394, IEEE
Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 9
0.55
0.33
NDCG@K
HR@K
0.6
0.40
NDCG@K
HR@K
BPR BPR
VBPR VBPR
0.5 0.35
ACF ACF
SR SR
ContextMF ContextMF
VPOI VPOI
HASC HASC
0.4 0.3
5 6 7 8 9 10 5 6 7 8 9 10
Top-K Top-K
(c) HR@K on F L (d) NDCG@K on F L
Fig. 4. Overall performance of different models on the two datasets. (Better viewed in color.)
and Normalized Discounted Cumulative Gain (NDCG) [18], • ACF: it models the item level and component level
[5]. HR measures the percentage of images that are liked by attention for image recommendation with two atten-
users in the top-K list, and NDCG gives a higher score to tion networks. For fair comparison, we enrich this
the hit images that are ranked higher in the ranking list. As baseline by leveraging the upload history as users’
the image size is huge, it is inefficient to take all images as auxiliary feedback in this model [5].
candidates to generate recommendations. For each user, we • VPOI: it is a visual based POI recommendation al-
randomly select 100 unrated images as candidates, and then gorithm. This algorithm relies on the collective ma-
mix them with the records in the validation and test data to trix factorization to consider the associated images
select the top-K results. This evaluation process is repeated with each POI and the uploaded images of each
for 10 times and we report the average results [18], [5]. For user. To adapt the POI recommendation to image
both metrics, the larger the value, the better the ranking recommendation, we treat each image as a POI and
performance. the uploaded images of each user as the associated
Baselines. We compare our proposed HASC model with images of her. [49].
the following baselines:
Parameter setting. In the social embedding process with
• BPR: it is a classical ranking based latent factor Deepwalk [37], we set the parameters as: the window size
based model for recommendation with competing w = 10 and walks per vertex ρ = 80. The social embedding
performance. This method has been well recognized size d is set in the range [32, 64, 128]. We find when d = 128,
as a strong baseline for recommendation [40]. the social embedding reaches the best performance. Hence,
• SR: it is a social based recommendation model that we set d = 128 in Deepwalk. There are two important
encodes the social influence among users with social parameters in our proposed model: the dimension D of
regularization in classical latent factor based mod- the user and image embeddings, and the regularization
els [33]. parameter λ in the objective function (Eq.(12)). We choose
• ContextMF: this method models various social con- D in [10, 15, 20, 30] and λ in [0.001, 0.01, 01], and perform
textual factors, including item content topic, user grid search to find the best parameters. The best setting
personal interest, and inter-personal influence in is D = 15 and λ = 0.01. We find the dimension of the
a unified social contextual recommendation frame- attention networks does not impact the results much. Thus,
work [24]. we empirically set the dimensions of the parameters in
• VBPR: it extends BPR by modeling both the visual the attention networks as 20 (i.e., parameters in Θ2 ). The
and latent dimensions of users’ preferences in a uni- activation function σ(x) is set as the Leakly ReLU. To
fied framework, where the visual content dimension initialize the model, we randomly set the weights in the
is derived from a pre-trained VGG network. attention networks with a Gaussian distribution of mean
1041-4347 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2913394, IEEE
Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 10
0 and standard deviation 0.1. Since the objective function BPR baseline can not work well under this situation as
of HASC is non-convex, we initialize P and W from the it only modeled the sparse user-image implicit feedbacks.
basic BPR model, and Q and X with the same Gaussian Under this situation, the improvement is significant for all
distribution as the parameters of the attention networks to models over BPR as these models utilized different auxiliary
speed up convergence. We use mini-batch Adam to optimize data for recommendation. E.g., when users have less than 4
the model, where the batch size is set as 512 and the initial ratings, our proposed HASC model improves over BPR by
learning rate is set as 0.0005. There are several parameters more than 35%. As user rating scale increases, the perfor-
in the baselines, for fair comparison, all the parameters in mance of all models increase quickly with more training
the baselines are also tuned to have the best performance. rating records, and HASC still consistently outperforms the
For all models, we stop model training when both the HR@5 baselines.
and NDCG@5 on the validation dataset begins to decrease. TABLE 3
The improvement of using different attention mechanism
5.2 Overall Performance compared to BPR.
Fig. 4 shows the overall performance of all models on HR@K Bottom Layer Top Layer F S F L
Attention Attention HR NDCG HR NDCG
and NDCG@K on the two datasets with varying sizes of AVG AVG 6.44% 10.28% 5.54% 9.02%
K , where the top two subfigures depict the results on F S MAX MAX 5.82% 9.55% 4.98% 8.10%
dataset and the bottom two subfigures depict the results on AVG ATT 7.33% 11.15% 5.95% 9.93%
F L dataset. As shown in this figure, our proposed HASC MAX ATT 6.84% 10.96% 5.72% 9.55%
ATT AVG 12.75% 19.23% 8.30% 13.28%
model always performs the best. With the increase of the ATT MAX 12.20% 18.56% 8.02% 12.85%
top-K list size, the performance of all models increase. The ATT ATT 14.57% 22.55% 10.67% 16.70%
performance trend is consistent over different top-K values
TABLE 4
and different metrics. We find that considering either the The improvement of modeling different contextual aspects with
social network or the visual image information could allevi- our proposed model compared to BPR.(U:upload history, S:
ate the data sparsity problem and improve recommendation social influence, C: creator admiration)
performance. E.g., VBPR improves over BPR about 3% by F S F L
Aspects
incorporating the visual information in the modeling pro- HR NDCG HR NDCG
cess. ACF further improves VBPR by assigning the attentive U 8.70% 16.52% 6.44% 11.03%
S 9.63% 16.78% 5.29% 9.65%
weights to different images the user rated and uploaded
C 8.57% 14.53% 4.37% 7.93%
in the past. SR also has better performance as it leverages U+S+C 14.57% 22.55% 10.67% 16.70%
the social network information, and ContextMF further im-
proves the performance with content modeling. On average, 5.4 Attention Analysis
our proposed model shows about 20% improvement over
In this part, we conduct experiments to give more detailed
BPR baseline, and more than 10% improvement over the
analysis of the proposed attention network. We would eval-
best baselines on both datasets with regard to the NDCG@5
uate the soundness of the designed attention structure and
metric. Last but not the least, by comparing the results of
the superiority of combining the various data embeddings
F S and F L, we observe that for each method, the results
for attention modeling.
on F L always outperform F S. We guess a possible reason
In the experiments, we use the Leakly ReLU as the
is that, though F S is denser than F L, the larger F L has
activation function σ(x) for attention modeling, and then
nearly two times as many records as F S for training. As
attentively combine the elements of each set with a soft
the overall trend is similar on the two metrics with different
attention. Alternately, instead of attentively combining all
values of K , in the following of the subsections, for page
the elements, a direct solution is to use the hard atten-
limit, we only show the top-5 results.
tion with MAX operation that selects the element with
the largest attentive score at each layer of the hierarchi-
5.3 Performance under Different Data Sparsity cal attention network. E.g., for the upload history aspect,
A key characteristic of our proposed model is that it alle- Max learns the attentive upload history score in Eq.(6) as:
viates the data sparsity issue with various social contextual e a = xj , where lja = 1 ∧ (∀lka = 1, αja ≥ αka ). Partic-
x
aspects modeling. In this subsection, we investigate the per- ularly, if we simply set the attentive scores with the average
formance of various models under different data sparsity. pooling (i.e., αai = |L1a | , βab = |S1a | , γal = 13 ), our model
We mainly focus on the F L dataset as it is more chal- degenerates to an enhanced SVD++ with social contextual
lenging with sparser user rating records compared to the modeling but without any attentive modeling. If we do not
denser F S dataset. Specifically, we bin users into different model any social contextual aspects, our model degenerates
groups based on the number of the observed feedbacks in to the BPR model [40]. Table 3 shows the results of different
the training data, and then show the performance under attention mechanism. As shown in this table, the best results
different groups. Fig. 5 shows the results, where the left part are achieved by using our proposed attention mechanism,
summarizes the user group distribution of the training data followed by AVG and MAX. We guess a possible reason is
and the right part depicts the performance with different that: each user’s interests are diversified, and it is challeng-
data sparsity. As shown in the left part, more than 5% ing to infer each user’s interests from the limited training
users have less than 4 ratings, and 20% users have less data. If we simply using a hard attention with the maximum
than 16 ratings with more than 100 thousand images on value or adopting average aggregation, many valuable con-
the F L dataset. When the rating scale is very sparse, the textual information is neglected in this process. Besides, we
1041-4347 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2913394, IEEE
Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 11
0.35
30 BPR
Percentage(%)
VBPR
NDCG@5
0.30
20 ACF
SR
ContextMF
0.25
10 VPOI
HASC
0 0.20
[0,4) [4,16) [16 64) [64 256) [256,) [0,4) [4,16) [16,64) [64,256) [256,)
Num. of ratings for each user Num. of ratings for each user
observe that ATT that operates at the bottom layer achieves the social embeddings (i.e., ea ), and the visual embeddings
much better performance than its counterparts that operates with content representations (i.e., fic of image i and fac of
on the top layer (e.g., the comparison results between the user a) and style representations (i.e., fis of image i and
fourth row and the sixth row). Since each aspect at the fas of user a ). Table 5 shows the performance of HASC
bottom layer usually contains much more elements than the with different kinds of input embeddings. From this table,
top layer, attentively summarizing each contextual aspect at we have several observations. First, as the auxiliary latent
the bottom layer would provide valuable information for embedding representation could model each user and each
the top layer. In contrast, if we use AVG or MAX at the item from the rich social contextual information, taking the
bottom layer, the results are not satisfactory when we use auxiliary embeddings could improve the performance than
“ATT” at the second layer, since the input of the second solely feeding the base embeddings for attention modeling.
layer lacks many important information. Second, the improvement of social embeddings is not very
After showing the soundness of our proposed atten- significant. We guess a possible reason is that, the social
tion structure, Table 4 presents the performance of using influence aspect already considers the social neighborhood
different contextual aspects with our proposed hierarchical information for users’ interest modeling. As the social em-
attention. As shown in this table, each aspect improves beddings represent the overall social network with both
the performance. By combining all social contextual aspects local and global structure, the improvement is limited with
with hierarchical attention, the model reaches the best per- the additional global network structure modeling. Third, we
formance. observe that the improvement of the visual embeddings is
very significant. Both the content and the style informa-
TABLE 5 tion could enhance the recommendation performance. By
Performance of different kinds of inputs for attention modeling combining content and style embeddings, the performance
(Base: base embedding, Aux: auxiliary embedding, Soc: social
embedding, Vis C(Vis C): visual content(style) embedding). further improves. This observation empirically shows the
with “Base” denotes the base embedding, “Aux” denotes the complementary relationship of content and style in visual
auxiliary embedding, ”Soc” denotes the social embedding, and images. Last but not least, by feeding the three differ-
“Vis C”, “Vis S”, “Vis CS” denotes the visual content feature, ent kinds of data embeddings into the attention network
visual style feature, and both visual features.
embedding, the proposed HASC could achieve the best
F S F L performance.
Input Embedding
HR NDCG HR NDCG
Base 0.358 0.257 0.439 0.319 In the previous experiments, we use the DeepWalk as
Base+Aux 0.366 0.264 0.445 0.323 the social network embedding model to obtain the social
Base+Aux+Soc 0.367 0.270 0.450 0.331 network embedding vector of each user. Now we would
Base+Aux+Vis C 0.388 0.278 0.453 0.335 show the effectiveness of adopting different network em-
Base+Aux+Vis S 0.383 0.275 0.451 0.332
Base+Aux+Vis CS 0.393 0.282 0.464 0.342
bedding techniques. We choose two state-of-the-art network
Base+Aux+Soc+Vis CS 0.400 0.289 0.475 0.347 embedding models: LINE [44] and GCN [25], and compare
the performance. The results are shown in Table 6. As can
be seen from this table, when the item visual embeddings
TABLE 6 are not incorporated, using the advanced graph embedding
Performance of different kinds of social embedding techniques
for the attention modeling. techniques (e.g., GCN), could partially improve the recom-
mendation performance, as these advanced models could
F S F L better capture the social network structure. When all the
Input Embedding
HR NDCG HR NDCG
Base+Aux+DeepWalk 0.367 0.270 0.450 0.331 input embeddings are incorporated, these advanced graph
Base+Aux+LINE 0.369 0. 273 0.452 0.334 embedding models show similar performance compared to
Base+Aux+GCN 0.371 0.276 0.459 0.340 the DeepWalk based network embedding model. We guess
Base+Aux+DeepWalk+Vis CS 0.400 0.289 0.475 0.347 the reason is that, as stated as Table 5, the improvement of
Base+Aux+LINE+Vis CS 0.400 0.289 0.474 0.345
Base+Aux+GCN+Vis CS 0.401 0.290 0.475 0.348
the social embedding is not as significant as the visual based
input for attention modeling when all the input embeddings
Besides, in the attention modeling process, we also learn are considered.
the attentive weights by modeling different kinds of input Attention Weights Visualization. Besides given the
embeddings from the heterogeneous data sources. For each overall results of different attention modeling setting, we
attention layer, it consists the following kinds of inputs: give a visualization of the learned attention weights of
the latent interest representations of base embeddings (i.e., users from the F L dataset. Firstly, for each user, we group
pa and wi ) and auxiliary embeddings (i.e., qa and xi ), her into three categories according the aspect that has the
1041-4347 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2913394, IEEE
Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 12
1041-4347 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2913394, IEEE
Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 13
……
Social ra2
C 0.44 HASC 0.68 0.24 liked the image. User a has liked 1/8 of the
Owner ra3
images by the owner.
U 0.36 BPR 0.18 Upload rb1 0.46 For the test image, its style and content looks
like the training images in the first row. None of
b S 0.28 SVD++ 0.43 0.22
……
Social rb2 b’s followers’ liked the image. User a has liked
C 0.35 HASC 0.86 Owner rb3 0.30 1/3 of the images by the owner.
U 0.48 BPR 0.30 0.36 For the test image, its content looks like many
Upload rc1
0.33 images in the training data. 1/4 of a’s followers’
……
U 0.36 BPR 0.08 0.40 For the test image, its content and style of rarely
Upload rd1
S 0.28 SVD++ 0.52 0.29
appeared in the user d’s training data. The user
d rd2
……
Social
liked 3/7 of the images uploaded by the owner.
C 0.68 HASC 0.44 Owner rd3 0.31 None of user d’s followers’ like this image.
Fig. 7. The case study of several typical users. In this figure, each row represents a user. The first and the second column are the
training and test images of the user. The Top-5 recommendation results of NDCG@5 are shown in the third column. In the third
column, the left three models are simplified versions of our proposed HASC model that only leverage one aspect, and the model
with best performance is shown with bold italic letters.
[13] F. Gelli, X. He, T. Chen, and T.-S. Chua. How personality affects [29] J. Li, M.-T. Luong, and D. Jurafsky. A hierarchical neural autoen-
our likes: Towards a better understanding of actionable images. In coder for paragraphs and documents. arXiv:1506.01057, 2015.
MM, pages 1828–1837. ACM, 2017. [30] D. C. Liu, S. Rogers, R. Shiau, D. Kislyuk, K. C. Ma, Z. Zhong,
[14] F. Gelli, T. Uricchio, X. He, A. Del Bimbo, and T.-S. Chua. Beyond J. Liu, and Y. Jing. Related pins at pinterest: The evolution of a
the product: Discovering image posts for brands in social media. real-world recommender system. In WWW, pages 583–592, 2017.
In MM. ACM, 2018. [31] Q. Liu, Y. Zeng, R. Mokhosi, and H. Zhang. Stamp: short-term
[15] Y. Gong and Q. Zhang. Hashtag recommendation using attention- attention/memory priority model for session-based recommenda-
based convolutional neural network. In IJCAI, pages 2782–2788, tion. In SIGKDD, pages 1831–1839. ACM, 2018.
2016. [32] P. Loyola, C. Liu, and Y. Hirate. Modeling user session and intent
[16] G. Guo, J. Zhang, and N. Yorke-Smith. A novel recommendation with an attention-based encoder-decoder architecture. In RecSys,
model regularized with user trust and item ratings. TKDE, pages 147–151. ACM, 2017.
28(7):1607–1620, 2016. [33] H. Ma, D. Zhou, C. Liu, M. R. Lyu, and I. King. Recommender
[17] R. He, C. Fang, Z. Wang, and J. McAuley. Vista: a visually, systems with social regularization. In WSDM, pages 287–296.
socially, and temporally-aware model for artistic recommendation. ACM, 2011.
In Recsys, pages 309–316. ACM, 2016. [34] A. Mnih and R. R. Salakhutdinov. Probabilistic matrix factoriza-
[18] R. He and J. McAuley. Vbpr: Visual bayesian personalized ranking tion. In NIPS, pages 1257–1264, 2008.
from implicit feedback. In AAAI, pages 144–150, 2016. [35] W. Niu, J. Caverlee, and H. Lu. Neural personalized ranking for
[19] X. He, Z. He, J. Song, Z. Liu, Y.-G. Jiang, and T.-S. Chua. Nais: Neu- image recommendation. In WSDM, pages 423–431. ACM, 2018.
ral attentive item similarity model for recommendation. TKDE, [36] W. Pan and Z. Ming. Collaborative recommendation with multi-
2018. class preference context. IEEE Intelligent Systems, 32(2):45–51, 2017.
[20] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua. Neural [37] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning
collaborative filtering. In WWW, pages 173–182, 2017. of social representations. In KDD, pages 701–710. ACM, 2014.
[21] B. Hu, C. Shi, W. X. Zhao, and P. S. Yu. Leveraging meta- [38] X. Qian, H. Feng, G. Zhao, and T. Mei. Personalized recommenda-
path based context for top-n recommendation with a neural co- tion combining user interest and social circle. TKDE, 26(7):1763–
attention model. In SIGKDD, pages 1531–1540. ACM, 2018. 1777, 2014.
[22] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual [39] S. Rendle. Factorization machines with libfm. TIST, 3(3):57, 2012.
attention for rapid scene analysis. PAMI, 20(11):1254–1259, 1998. [40] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme.
[23] M. Jamali and M. Ester. A matrix factorization technique with trust Bpr: Bayesian personalized ranking from implicit feedback. In
propagation for recommendation in social networks. In RecSys, UAI, pages 452–461. AUAI Press, 2009.
pages 135–142. ACM, 2010. [41] S. Seo, J. Huang, H. Yang, and Y. Liu. Interpretable convolutional
[24] M. Jiang, P. Cui, F. Wang, W. Zhu, and S. Yang. Scalable recom- neural networks with dual local and global attention for review
mendation with social contextual information. TKDE, 26(11):2789– rating prediction. In Recsys, pages 297–305. ACM, 2017.
2802, 2014. [42] K. Simonyan and A. Zisserman. Very deep convolutional net-
[25] T. N. Kipf and M. Welling. Semi-supervised classification with works for large-scale image recognition. In ICLR, 2015.
graph convolutional networks. In ICLR, 2017. [43] P. Sun, L. Wu, and M. Wang. Attentive recurrent social recommen-
[26] Y. Koren. Factorization meets the neighborhood: a multifaceted dation. In SIGIR, pages 185–194. ACM, 2018.
collaborative filtering model. In KDD, pages 426–434. ACM, 2008. [44] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. Line:
[27] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classifi- Large-scale information network embedding. In WWW, pages
cation with deep convolutional neural networks. In NIPS, pages 1067–1077, 2015.
1097–1105, 2012. [45] J. Tang, X. Shu, G.-J. Qi, Z. Li, M. Wang, S. Yan, and R. Jain. Tri-
[28] C. Lei, D. Liu, W. Li, Z.-J. Zha, and H. Li. Comparative deep clustered tensor completion for social-aware image tag refinement.
learning of hybrid representations for image recommendations. In PAMI, 39(8):1662–1674, 2017.
CVPR, pages 2545–2553, 2016. [46] Y. Tay, A. T. Luu, and S. C. Hui. Multi-pointer co-attention
1041-4347 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2913394, IEEE
Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 14
networks for recommendation. In SIGKDD, pages 2309–2318. Richang Hong (M’12) is currently a professor
ACM, 2018. at HFUT. He received the Ph.D. degree from
[47] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and USTC, in 2008. He has co-authored over 60 pub-
Y. Bengio. Graph attention networks. In ICLR, 2018. lications in the areas of his research interests,
[48] D. Wang, P. Cui, and W. Zhu. Structural deep network embedding. which include multimedia question answering,
In KDD, pages 1225–1234. ACM, 2016. video content analysis, and pattern recognition.
[49] S. Wang, Y. Wang, J. Tang, K. Shu, S. Ranganath, and H. Liu. What He is a member of the Association for Computing
your images reveal: Exploiting visual contents for point-of-interest Machinery. He was a recipient of the best paper
recommendation. In WWW, pages 391–400, 2017. award in the ACM Multimedia 2010.
[50] L. Wu, Y. Ge, Q. Liu, E. Chen, R. Hong, J. Du, and M. Wang.
Modeling the evolution of users’ preferences and social links in
social networking services. TKDE, 29(6):1240–1253, 2017.
[51] L. Wu, P. Sun, R. Hong, Y. Ge, and M. Wang. Collaborative neural
social recommendation. TSMC: Systems, 2019.
[52] J. Xiao, H. Ye, X. He, H. Zhang, F. Wu, and T.-S. Chua. Attentional
factorization machines: Learning the weight of feature interactions
via attention networks. In IJCAI, pages 3119–3125.
[53] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov,
R. Zemel, and Y. Bengio. Show, attend and tell: Neural image Yanjie Fu received his Ph.D. degree from
caption generation with visual attention. In ICML, pages 2048– Rugters University in 2016, the B.E. degree from
2057, 2015. University of Science and Technology of China
[54] Z. Yang, D. Yang, C. Dyer, X. He, A. J. Smola, and E. H. Hovy. in 2008, and the M.E. degree from Chinese
Hierarchical attention networks for document classification. In Academy of Sciences in 2011. He is currently
HLT-NAACL, pages 1480–1489, 2016. an Assistant Professor at the Missouri University
[55] F. Zhang, N. J. Yuan, D. Lian, X. Xie, and W.-Y. Ma. Collaborative of Science and Technology. His general interests
knowledge base embedding for recommender systems. In KDD, are data mining and big data analytics. He has
pages 353–362. ACM, 2016. published proficiently in referred journals and
[56] Q. Zhang, J. Wang, H. Huang, X. Huang, and Y. Gong. Hashtag conference proceedings, such as IEEE TKDE,
recommendation for multimodal microblog using co-attention ACM TKDD, IEEE TMC and ACM SIGKDD.
network. In IJCAI, pages 3420–3426, 2017.
[57] S. Zhang, L. Yao, A. Sun, and Y. Tay. Deep learning based
recommender system: A survey and new perspectives. CSUR,
52(1):5, 2019.
[58] S. Zhang, L. Yao, and X. Xu. Autosvd++: An efficient hybrid
collaborative filtering model via contractive auto-encoders. In
SIGIR, pages 957–960. ACM, 2017.
[59] Z. Zhao, H. Lu, D. Cai, X. He, and Y. Zhuang. User preference
learning for online social recommendation. TKDE, 28(9):2522– Xing Xie (SM’09) is currently a senior re-
2534, 2016. searcher in Microsoft Research Asia, and a
guest PhD advisor at USTC. His research inter-
est include spatial data mining, location-based
services, social networks, and ubiquitous com-
puting. In recent years, he was involved in the
program or organizing committees of over 70
conferences and works. Especially, he initiated
the LBSN workshop series and served as pro-
Le Wu is currently an assistant professor at the gram co-chair of ACM Ubicomp 2011. He is a
Hefei University of Technology (HFUT), China. senior member of ACM and the IEEE, and a
She received the Ph.D. degree from the Univer- distinguished member of China Computer Federation (CCF).
sity of Science and Technology of China (USTC).
Her general area of research interests is data
mining, recommender systems and social net-
work analysis. She has published more than
30 papers in referred journals and conferences.
Dr. Le Wu is the recipient of the Best of SDM
2015 Award, and the Distinguished Dissertation
Award from China Association for Artificial Intel-
ligence (CAAI) 2017. Meng Wang is a professor at the Hefei Univer-
sity of Technology, China. He received his B.E.
degree and Ph.D. degree in the Special Class
for the Gifted Young and the Department of
Electronic Engineering and Information Science
from the University of Science and Technology
of China (USTC), Hefei, China, in 2003 and
2008, respectively. His current research interests
Lei Chen is currently working towards the M.S. include multimedia content analysis, computer
degree at Hefei University of Technology, China. vision, and pattern recognition. He has authored
He received the B.S. degree from Anhui Uni- more than 200 book chapters, journal and con-
versity in 2016. His research interests include ference papers in these areas. He is the recipient of the ACM SIGMM
multimedia analysis and data mining. Rising Star Award 2014. He is an associate editor of IEEE Transactions
on Knowledge and Data Engineering (IEEE TKDE), IEEE Transactions
on Circuits and Systems for Video Technology (IEEE TCSVT), IEEE
Transactions on Multimedia (IEEE TMM), and IEEE Transactions on
Neural Networks and Learning Systems (IEEE TNNLS).
1041-4347 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
The model employs a hierarchical attention network that differentially weights the three social contextual aspects (upload history, social influence, and owner admiration) based on their relevance to each specific user. The aspect importance attention on the top layer assigns different weights to these aspects, allowing the system to prioritize them according to their significance for the user's image recommendation process, thus providing a more customized recommendation experience .
The hierarchical attention network model ensures more adaptive image recommendations by modeling contextual aspects through attention mechanisms that dynamically adjust the weightage based on each user's interactions and preferences. Unlike traditional collaborative filtering which relies heavily on fixed user-item interactions, this model's attentional layers can pivot focus based on contextual relevance, such as recent uploads or social influence, yielding recommendations that better reflect current user interests and behaviors .
The hierarchical attention network distinguishes itself from the basic latent factor model like SVD++ by incorporating attentional weights specific to each user's preferences and the contextual importance of different aspects. Unlike SVD++, which treats aspects equally, the hierarchical model adapts weights dynamically by considering the user’s upload history, social influences, and creator admiration, thereby personalizing recommendations based on more nuanced behavioral patterns .
The hierarchical structure in the model is crucial for managing complex relationships in social image platforms by organizing various contextual elements into a cohesive framework. It constructs user interest representation by first creating auxiliary aspect representations, then combining them into an auxiliary user interest vector. This approach allows the model to process different levels of information and prioritize them according to relevance, effectively capturing the nuances of social interactions and individual preferences in recommendations .
The hierarchical attention model in social image recommendations enhances traditional latent factor models by incorporating three key social contextual aspects: upload history, social influence, and owner admiration, which helps in capturing complex user preferences. The model uses a hierarchical structure to weigh these aspects differently based on their importance to each user, improving recommendation quality by leveraging rich contextual information .
The upload history attention in the hierarchical model selects and aggregates images from each user's upload history that best represent their preferences. It uses a three-layer neural network to calculate the upload history attentive score, ensuring that selected images reflect the user’s current interests. This personalized attention helps in tailoring recommendations to be more relevant based on past behavior .
User-specific attentive weights significantly enhance the recommendation model's performance by customizing the importance of different social contextual aspects for each individual user. These weights ensure that the model captures the unique preference profiles and interactions of each user with social networks and content creators, leading to more accurate and personalized image recommendations. This tailored approach addresses the inherent variability in user behavior and content interest, making the system adaptable and responsive to individual user dynamics .
The proposed model integrates visual content and style embeddings to enhance image recommendation performance. Visual embeddings are included in the upload history attention network, where they help characterize the user's interest by recognizing patterns in the visual appeal of previously uploaded images. These embeddings help the model understand the aesthetic preferences of users, contributing to more precise recommendations that align with their visual tastes .
The three social contextual aspects play specific roles in the image recommendation model: 1) Upload history summarizes user interests based on previously uploaded images. 2) Social influence assesses the influence from the user's social network. 3) Owner admiration evaluates the influence of the uploader of the recommended image. Each aspect provides a different perspective on user preferences, which are combined into an auxiliary user latent embedding to improve recommendation accuracy .
Replacing attention scores with equal weights in the model could significantly deteriorate recommendation quality by disregarding the differential importance of each element and aspect for different users. Such a configuration treats all users and their interactions uniformly, ignoring the personalized nuances of user preferences derived from historical behavior and social context. This lack of personalization could lead to less accurate recommendations, as the model would fail to recognize the distinct influence of social contexts on individual user choices .