0% found this document useful (0 votes)
27 views10 pages

DLDay18 Paper 5

MTNet is a neural model that improves cross-domain recommendations by exploiting unstructured text content and transferring knowledge across domains. It uses a memory network (MNet) to attentively extract useful information from text, and a novel transfer network (TNet) to selectively transfer knowledge between domains. MTNet couples these networks with shared embeddings and an interaction layer, and can be trained end-to-end to minimize a loss function. On real-world datasets, MTNet shows better performance than baselines in terms of ranking metrics.

Uploaded by

hilla464
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views10 pages

DLDay18 Paper 5

MTNet is a neural model that improves cross-domain recommendations by exploiting unstructured text content and transferring knowledge across domains. It uses a memory network (MNet) to attentively extract useful information from text, and a novel transfer network (TNet) to selectively transfer knowledge between domains. MTNet couples these networks with shared embeddings and an interaction layer, and can be trained end-to-end to minimize a loss function. On real-world datasets, MTNet shows better performance than baselines in terms of ranking metrics.

Uploaded by

hilla464
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

MTNet: A Neural Approach for Cross-Domain

Recommendation with Unstructured Text


Guangneng Hu, Yu Zhang, and Qiang Yang
Department of Computer Science and Engineering, Hong Kong University of Science and Technology
[email protected],[email protected],[email protected]
ABSTRACT likely to rate items similarly in the future. Matrix factoriza-
Collaborative filtering (CF) is the key technique for rec- tion (MF) techniques which can learn the latent factors for
ommender systems (RSs). CF exploits user-item behavior users and items are its main cornerstone [16, 22]. Recently,
interactions (e.g., clicks) only and suffers from the data spar- neural networks like multilayer perceptron (MLP) are used
sity issue. One solution is to integrate the content information to learn the interaction function from data [6, 10]. MF and
such as product reviews and news titles, leading to hybrid neural CF suffer from the data sparsity and cold-start issues.
filtering methods. Another solution is to transfer knowledge One solution is to integrate CF with the content informa-
from a related source domain such as improving the movie tion, leading to hybrid methods. Items are usually associated
recommendation with the knowledge of the book domain, with content information such as unstructured text, like the
leading to cross-domain methods where transfer learning is news articles and product reviews. These additional sources
a key technique. In real life, no single service can satisfy a of information can alleviate the sparsity issue and are essen-
user’s all information needs. Thus it motivates us to exploit tial for recommendation beyond user-item interaction data.
information from both the content and across domains for For application domains like recommending research papers
RSs in this paper. We achieve this by developing approaches and news articles, the unstructured text associated with the
to capture the text content and to transfer cross-domain item is its text content [1, 30]. Other domains like recom-
knowledge. We propose a novel neural model, MTNet (“M” mending products, the unstructured text associated with the
for memory and “T” for transfer), for cross-domain recom- item is its user reviews which justify the rating behavior
mendation with unstructured text in an end-to-end manner. of consumers [20, 36]. Topic modelling and neural networks
MTNet can attentively extract useful content via a memo- have been proposed to exploit the item content and lead to
ry network (MNet) and can selectively transfer knowledge performance improvement. Memory networks are widely used
from across domains by a transfer network (TNet), a novel in question answering and reading comprehension to perform
network. The principle underlying these two components is reasoning [28]. The memories can be naturally used to model
the neural attention mechanism. A shared layer of feature additional sources like the item content [14], or to model a
interactions is stacked on the top to couple the high-level user’s neighborhood who consume the common items with
representations learned from individual networks. On two this user [7].
real-world datasets, MTNet shows better performance in Another solution is to transfer the knowledge from relevant
terms of three ranking metrics by comparing with various domains and the cross-domain recommendation techniques
baselines, including single/cross domain, shallow/deep, and address such problems [2, 17, 24]. In real life, a user typically
hybrid methods. We conduct thorough analyses to under- participates several systems to acquire different information
stand how the text content and transferred knowledge help services. For example, a user installs applications in an app
the proposed model. store and reads news from a website at the same time. It
brings us an opportunity to improve the recommendation
performance in the target service (or all services) by learning
1 INTRODUCTION across domains. Following the above example, we can rep-
Recommender systems are widely used in various domains resent the app installation feedback using a binary matrix
and e-commerce platforms, such as to help consumers buy where the entries indicate whether a user has installed an app.
products at Amazon, watch videos on Youtube, and read ar- Similarly, we use another binary matrix to indicate whether
ticles on Google News. Collaborative filtering (CF) is among a user has read a news article. Typically these two matrices
the most effective approaches based on the simple intuition are highly sparse, and it is beneficial to learn them simul-
that if users rated items similarly in the past then they are taneously. This idea is sharpened into the collective matrix
factorization (CMF) [27] approach which jointly factorizes
Permission to make digital or hard copies of part or all of this work these two matrices by sharing the user latent factors. It com-
for personal or classroom use is granted without fee provided that bines CF on a target domain and another CF on an auxiliary
copies are not made or distributed for profit or commercial advantage
and that copies bear this notice and the full citation on the first page.
domain, enabling knowledge transfer [23, 35]. In terms of
Copyrights for third-party components of this work must be honored. neural networks, given two activation maps from two tasks,
For all other uses, contact the owner/author(s). cross-stitch convolutional networks (CSN) [21] learn linear
KDD’18 Deep Learning Day, August 2018, London, UK
combinations of both the input activations and feed these
© 2018 Copyright held by the owner/author(s).
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM.
https://doi.org/10.1145/nnnnnnn.nnnnnnn
KDD’18 Deep Learning Day, August 2018, London, UK Guangneng Hu, Yu Zhang, and Qiang Yang

combinations as input to the successive layers’ filters, and 2 THE PROPOSED MTNET MODEL
hence enabling the knowledge transfer between two domains. We describe the proposed MTNet model in this section. MT-
Thus it motivates us to exploit information from both Net models user preferences in the target domain by ex-
the content and cross-domain information for RSs in this ploiting the text content and transferring knowledge from a
paper. To capture text content and to transfer cross-domain source/auxiliary domain. MTNet learns high-level represen-
knowledge, we propose a novel neural model, MTNet, for tations for unstructured text and source domain items such
cross-domain recommendation with unstructured text in an that the learned representations can estimate the conditional
end-to-end manner. MTNet can attentively extract useful probability of that whether a user will like an item. This is
content via a memory network (MNet) and can selectively done with a memory network (Sec. 2.2) and a novel transfer
transfer knowledge across domains by a transfer network (T- network (Sec. 2.3), coupled by the shared embeddings on the
Net), a novel network. A shared layer of feature interactions bottom and an interaction layer on the top (Sec. 2.4). The
is stacked on the top to couple the high-level representations entire network can be trained efficiently to minimize a binary
learned from individual networks. On real-world datasets, cross-entropy loss by back-propagation (Sec. 2.5). We begin
MTNet shows the better performance in terms of ranking by describing the recommendation problem and the model
metrics by comparing with various baselines. We conduct formulation before introducing the network architecture.
thorough analyses to understand how the content and trans-
ferred knowledge help MTNet. 2.1 Problem and Model Formulation
To the best of our knowledge, MTNet is the first deep
We have a target domain (e.g., news domain) user-item inter-
model that transfers cross-domain knowledge for recommen-
action matrix 𝑅𝑇 ∈ R𝑚×𝑛𝑇 and a source domain (e.g., app
dation with unstructured text in an end-to-end learning. Our
domain) matrix 𝑅𝑆 ∈ R𝑚×𝑛𝑆 where 𝑚 = |𝒰| and 𝑛𝑇 = |ℐ𝑇 |
contributions are summarized as follows:
(𝑛𝑆 = |ℐ𝑆 |) is the size of users 𝒰 and target items ℐ𝑇 (source
∙ The proposed MTNet exploits the text content and items ℐ𝑆 ). Note that the users are shared and hence we can
transfers the source domain using an attention mecha- transfer knowledge across domains. We use 𝑢 to index users,
nism which is trained in an end-to-end manner. It is 𝑖 to target items and 𝑗 to source items. The entry 𝑟𝑢𝑖 ∈ {0, 1}
the first deep model that transfers cross-domain knowl- is one if the user 𝑢 interacted with the item 𝑖 and zero oth-
edge for recommendation with unstructured text using erwise. Let [𝑗]𝑢 = (𝑗1 , 𝑗2 , ..., 𝑗𝑠 ) be the 𝑠-sized source items
attention based neural networks. that user 𝑢 has interacted with in the source domain.
∙ The memory component (MNet) can attentively exploit The target domain also has the content information (e.g.,
the text content to match word semantics and user product reviews). Denote by 𝑑𝑢𝑖 the content text corre-
preferences. It is among a few recent works on adopting sponding to user 𝑢 and item 𝑖. It is a sequence of words
memory networks for hybrid recommendations. 𝑑𝑢𝑖 = (𝑤1 , 𝑤2 , ..., 𝑤𝑙 ) where each word 𝑤 comes from a vo-
∙ The transfer component (TNet) can selectively transfer cabulary 𝒱 and 𝑙 = |𝑑𝑢𝑖 | is the length of the text document.
source items with the guidance of target user-item For the task of item recommendation, the goal is to gener-
interactions. It is a novel transfer network for cross- ate a ranked list of items for each user based on her history
domain recommendation. records, i.e., top-N recommendations. We hope improve the
∙ The proposed model can alleviate the sparsity issue in- recommendation performance in the target domain with the
cluding cold-user and cold-item start, and outperforms help of both the content and source domain information.
various baselines in terms of ranking metrics on two Denote by the vector 𝑟𝑢 = (𝑟𝑢1 , 𝑟𝑢2 , ..., 𝑟𝑢𝑛𝑇 ) the interac-
real-world datasets. tions of user 𝑢. The proposed MTNet models the probability
of his/her each observation conditioned on this user, the
The paper is organized as follows. We firstly introduce content text, and the interacted source items:
the problem formulation in Section 2.1. Then we present 𝑟ˆ𝑢𝑖 , 𝑝(𝑟𝑢𝑖 = 1|𝑢, 𝑑𝑢𝑖 , [𝑗]𝑢 ). (1)
the memory component (MNet) to exploit the text content
in Section 2.2 and the transfer component (TNet) to trans- The equation sharpens the intuition behind the MTNet model,
fer cross-domain knowledge in Section 2.3 respectively. We that is, the conditional probability of whether user 𝑢 will
propose the neural model MTNet for cross-domain recom- like the item 𝑖 can be determined by three factors: 1) his/her
mendation with unstructured text in Section 2.4, followed individual preferences, 2) the corresponding content text
by its model learning (Section 2.5) and complexity analyses (𝑑𝑢𝑖 ), and 3) his/her behavior in a related source domain
(Section 2.6). In Section 3, we experimentally demonstrate ([𝑗]𝑢 ). The likelihood function of the entire matrix R𝑇 is then
the superior performance of the proposed model over various defined as:
baselines (Sec. 3.2). We show the benefit of both transferred 𝑝(R𝑇 ) = Π𝑢 Π𝑖 𝑝(𝑟𝑢𝑖 |𝑢, 𝑑𝑢𝑖 , [𝑗]𝑢 ). (2)
knowledge and text content for the proposed model in Sec-
The proposed MTNet is a neural network to learn the
tion 3.3. We can reduce the amount of cold users and cold
conditional probability in an end-to-end manner (see Fig. 1):
items that are difficult to predict accurately (Section 3.4)
and hence alleviate the cold-start issues. We review related 𝑟ˆ𝑢𝑖 = 𝑓 (𝑢, 𝑖, 𝑑𝑢𝑖 , [𝑗]𝑢 |Θ𝑓 ), (3)
works in Section 4 and conclude the paper in Section 5. where 𝑓 is the network function and Θ𝑓 are model parameters.
MTNet: A Neural Approach for Cross-Domain Recommendation with Unstructured
KDD’18 Text
Deep Learning Day, August 2018, London, UK

the relevance of user 𝑢 to these words given item 𝑖 as:


Softmax
𝑟𝑢𝑖
Ƹ Loss 𝑟𝑢𝑖
(𝑢) (𝑖)
𝑞𝑘 = 𝑥𝑇𝑢 𝑚𝑘 + 𝑥𝑇𝑖 𝑚𝑘 , (4)
Shared layer
(𝑢) (𝑖)
𝐖𝑜 𝐖𝑐 where we split the 𝑚𝑘 = [𝑚𝑘 , 𝑚𝑘 ] into the user part
MNet 𝐖𝑧 TNet (𝑢) (𝑖)
𝑚𝑘 and the item part 𝑚𝑘 . The 𝑥𝑢 and 𝑥𝑖 are the user
𝒐𝑢𝑖 𝒄𝑢𝑖
Sum 𝒛𝑢𝑖 and item embeddings obtained by embedding matrices 𝑃 ∈
C
ReLU R𝑚×𝑑 and 𝑄 ∈ R𝑛𝑇 ×𝑑 respectively. On the right hand of the
𝒄𝑘 ReLU
above equation, the first term captures the matching between
Doc preferences of user 𝑢 and word semantics, for example, the
𝑝𝑘 𝑑𝑢𝑖 α1 Sum αs
Softmax Lin.
user is a machine learning researcher and he/she may be more
A
map 𝒙𝑗1 … 𝒙𝑗s interested in the words such as “optimization” and “Bayesian”
𝒎𝑘
than those of “history” and “philosophy”. The second term
Dot prod. Dot prod. computes the support of item 𝑖 to the words, for example, the
𝒙𝑢𝑖 H H item is a machine learning related article and it may support
𝑗1 … 𝑗𝑠
more the words such as “optimization” and “Bayesian” than
𝒙𝑢 𝒙𝑖 Source items those of “history” and “philosophy”. Together, the content-
P Q based/associative addressing scheme can determine internal
u User i Target item memories with highly relevance to the target user 𝑢 regarding
the words 𝑑𝑢𝑖 given the specific item 𝑖.
Figure 1: The Proposed MTNet Architecture. The Actually we can compact the above two terms with a single
memory network (MNet) is to model unstructured vector dot product by concatenating the embeddings of the
text. The transfer network (TNet) is to transfer user and the item into 𝑥𝑢𝑖 = [𝑥𝑢 , 𝑥𝑖 ]:
cross-domain knowledge. The middle part is to learn 𝑞𝑘 = 𝑥𝑇𝑢𝑖 𝑚𝑘 , 𝑘 = 1, 2, ..., 𝑙. (5)
the user-item interaction function. The shared layer
is to learn feature interactions from the outputs of The neural attention mechanism can adaptively learn the
individual networks. weighting function over the words to focus on a subset of
them. Traditional combination of words predefines a heuristic
weighting function such as average or weights with tf-idf
scores. Instead, we compute the attentive weights over words
The model consists of a memory network o𝑢𝑖 = 𝑓𝑀 (𝑢, 𝑖, 𝑑𝑢𝑖 |Θ𝑀 ) for a given user-item interaction to infer the importance of
to model unstructured text (Sec. 2.2) and a transfer net- each word’s unique contribution:
work c𝑢𝑖 = 𝑓𝑇 (𝑖, [𝑗]𝑢 |Θ𝑇 ) to transfer knowledge from the
source domain (Sec. 2.3). A shared feature interaction layer exp(𝛽𝑞𝑘 )
𝑝𝑘 = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥(𝑞𝑘 ) = ∑︀𝑙 , (6)
𝑓𝑆 (o𝑢𝑖 , z𝑢𝑖 , c𝑢𝑖 |Θ𝑆 ) where z𝑢𝑖 is the non-linear representa- 𝑘′ =1 exp(𝛽𝑞𝑘 )

tions of the (𝑢, 𝑖) interaction, is stacked on the top of the which produces a probability distribution over the words
learned high-level representations from individual networks. in 𝑑𝑢𝑖 . The neural attention mechanism allows the memory
component to focus on specific words while to place little
2.2 MNet: Modeling Unstructured Text importance on other words which may be less relevant. The
We introduce the memory component, MNet, to exploit the parameter 𝛽 is introduced to stabilize the numerical com-
content information, i.e., unstructured text. MNet is a variant putation when the exponentials of the softmax function are
of memory augmented neural network which can learn high- very large and it also can amplify or attenuate the preci-
level representations of unstructured text with respect to sion of the attention like a temperature [11] where a higher
the given user-item interaction. The attention mechanism temperature (i.e., a smaller 𝛽) produces a softer probability
1
inherent in the memory component can determine which distribution over words. We set 𝛽 = 𝑑− 2 by scaling along
words are highly relevant to the user preferences. with the dimensionality [29].
The MNet consists of one internal memory matrix A ∈ We construct the high-level representations by interpolat-
R𝐿×2𝑑 where 𝐿 is the vocabulary size (typically 𝐿 = 8000) ing the external memories with the attentive weights as the
and 2𝑑 is the dimension of each memory slot, and one external output of the MNet:
memory matrix C with the same dimensions as A. The ∑︁
𝑜𝑢𝑖 = 𝑝𝑘 𝑐 𝑘 , (7)
function of the two memory matrices works as follows. 𝑘
Given a document 𝑑𝑢𝑖 = (𝑤1 , 𝑤2 , ..., 𝑤𝑙 ) corresponding to where the external memory slot 𝑐𝑘 ∈ R𝑑 is another embed-
the (𝑢, 𝑖) interaction, we form the memory slots 𝑚𝑘 ∈ R2𝑑 ding vector for word 𝑤𝑘 by mapping it with matrix C. The
by mapping each word 𝑤𝑘 into an embedding vector with external memories allows the storage of long-term knowledge
matrix A, where 𝑘 = 1, ..., 𝑙 and the length of the longest pertaining specifically to each word’s role in matching the
document is the memory size 𝑙𝑚𝑎𝑥 . We form a preference user preference. In other words, the content-based addressing
vector 𝑞 corresponding to the given document 𝑑𝑢𝑖 and the scheme identifies important words in a document acting as
user-item interaction (𝑢, 𝑖) where each element 𝑞𝑘 encodes a key to retrieval the relevant values stored in the external
KDD’18 Deep Learning Day, August 2018, London, UK Guangneng Hu, Yu Zhang, and Qiang Yang

memory matrix 𝐶 via the neural attention mechanism. The Table 1: Model Parameters of MTNet.
attention mechanism adaptively weights words according to
the specific user and item. The final output 𝑜𝑢𝑖 represents Parameter Dimension Description
a high-level, summarized information extracted attentively 𝑃 𝑚×𝑑 User embedding matrix
from the text content involved with relations between the 𝑄 𝑛𝑇 × 𝑑 Target item embedding matrix
𝐻 𝑛𝑆 × 𝑑 Source item embedding matrix
user-item interaction (𝑢, 𝑖) and the corresponding words 𝑑𝑢𝑖 .
𝐴 𝐿 × 2𝑑 Internal memory matrix
𝐶 𝐿 × 2𝑑 External memory matrix
2.3 TNet: Transferring Source Knowledge Linear mapping weight and bias
𝑊,𝑏 2𝑑 × 𝑑, 𝑑
We introduce the transfer component, TNet, to exploit the for the user-item interaction
source domain knowledge. TNet is a novel network which Linear mapping for outputs
𝑊𝑜 , 𝑊𝑧 , 𝑊𝑐 𝑑×𝑑
can selectively transfer knowledge for cross-domain recom- of individual networks
mendation. The central idea is to learn adaptive weights over ℎ 3𝑑 Weight of the shared layer
source domain items specific to the given target user-item
interaction during the knowledge transfer.
Given the source items [𝑗]𝑢 = (𝑗1 , 𝑗2 , ..., 𝑗𝑠 ) with which the We can think of the recommendation with unstructured text
user 𝑢 has interacted in the source domain, TNet learns a as a question answering problem: the question to be addressed
transfer vector 𝑐𝑢𝑖 ∈ R𝑑 to capture the relations between the is to ask how likely a user prefers an item where the text
target item 𝑖 and source items given the user 𝑢. The underly- content is analogue to the story text and the query is analogue
ing observations can be illustrated in an example of improving to the given user-item interaction.
the movie recommendation by transferring knowledge from Remark II The computational process of MNet and TNet is
the book domain. When we predict the preference of a user similar. We firstly compute attentive weights over a collection
on the movie “The Lord of the Rings,” the importance of her of objects (words in MNet and items in TNet). Then we
read books such as “The Hobbit,” and “The Silmarillion” may summarize the high-level representation as the output (the
be much higher than those such as “Call Me by Your Name”. text representation 𝑜𝑢𝑖 in MNet and the transfer vector 𝑐𝑢𝑖
The similarities between target item 𝑖 and source items in TNet), weighted by the attentive probabilities which are
can be computed by their dot products: computed by a content-based addressing scheme.
(𝑖)
𝑎𝑗 = 𝑥𝑇𝑖 𝑥𝑗 , 𝑗 = 1, ..., 𝑠, (8)
𝑑
2.4 MTNet: The Proposed Neural Model
where 𝑥𝑗 ∈ R is the embedding for the source item 𝑗 by
The architecture for the proposed MTNet model is illustrated
an embedding matrix 𝐻 ∈ R𝑛𝑆 ×𝑑 . This score computes the in Figure 1 as a multi-layer feedforward neural network. The
compatibility between the target item and the source items input layer specifies embeddings of a user 𝑢, a target item
consumed by the user. For example, the similarity of target 𝑖, and the corresponding source items [𝑗]𝑢 = (𝑗1 , ..., 𝑗𝑠 ). The
movie 𝑖 = “The Lord of the Rings,” with the source book 𝑗 = content text 𝑑𝑢𝑖 is modelled by the memories in the MNet to
“The Hobbit” may be larger than that with the source book produce a high-level representation 𝑜𝑢𝑖 . The source items are
𝑗 ′ = “Call Me by Your Name” (given a user 𝑢). transferred into the transfer vector 𝑐𝑢𝑖 with the guidance of
We normalize similarity scores to be a probability distri- (𝑢, 𝑖) in the TNet. These computational pathes are introduced
bution over source items: in the above Sec. 2.2 and Sec. 2.3 respectively.
(𝑖)
𝛼𝑗 = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥(𝑎𝑗 ), (9) We now propose the MTNet model. Firstly, we use a single
and then the transfer vector is a weighted sum of the corre- hidden layer network to learn a nonlinear representation for
sponding source item embeddings: the user-item interaction:
∑︁
𝑐𝑢𝑖 = 𝑅𝑒𝐿𝑈 ( 𝛼𝑗 𝑥𝑗 ), (10) 𝑧𝑢𝑖 = 𝑅𝑒𝐿𝑈 (𝑊 𝑥𝑢𝑖 + 𝑏), (11)
𝑗

where we introduce non-linearity on the transfer vector by where 𝑊 and 𝑏 are the weight and bias parameters in the
activation function rectified linear unit (ReLU). Empirically hidden layer. Usually the dimension of 𝑧𝑢𝑖 is half of that 𝑥𝑢𝑖
we found that the activation function 𝑅𝑒𝐿𝑈 (𝑥) = 𝑚𝑎𝑥(0, 𝑥) in a typical tower-pattern architecture.
works well due to its non-saturating nature and suitability The outputs from the three individual networks can be
for sparse data. viewed high-level features of the content text, source domain
The transfer vector 𝑐𝑢𝑖 is a high-level representation, sum- knowledge, and the user-item interaction. They come from
marizing the knowledge from the source domain as the output different feature space learned by different networks. Thus,
of the TNet. TNet can selectively transfer representations we use a shared layer on the top of the all features:
from the corresponding embeddings of source items with the
1
guidance of the target user-item interaction. 𝑟ˆ𝑢𝑖 = , (12)
Remark I Memory networks are proposed to address the 1 + exp(−ℎ𝑇 𝑦𝑢𝑖 )
task of question answering where the memories are a short where ℎ is the parameter. And the joint representation:
story text and the query/input is a question related to the
story for which the answer can be reasoned by the network. 𝑦𝑢𝑖 = [𝑊𝑜 𝑜𝑢𝑖 , 𝑊𝑧 𝑧𝑢𝑖 , 𝑊𝑐 𝑐𝑢𝑖 ], (13)
MTNet: A Neural Approach for Cross-Domain Recommendation with Unstructured
KDD’18 Text
Deep Learning Day, August 2018, London, UK

is concatenated from the linear mapped outputs of individual Table 2: Datasets and Statistics.
networks where matrices 𝑊𝑜 , 𝑊𝑧 , 𝑊𝑐 are the corresponding
linear mapping transformations.. Dataset Domain Statistics Amount
Shared #Users 15,890
2.5 Model Learning #News 84,802
#Reads 477,685
Due to the nature of the implicit feedback and the task of Target Density 0.035%
item recommendation, the squared loss (ˆ 𝑟𝑢𝑖 − 𝑟𝑢𝑖 )2 may be Mobile News #Words 612,839
not suitable since it is usually for rating prediction. Instead, Avg. Words Per News 7.2
we adopt the binary cross-entropy loss: #Apps 14,340
∑︁ Source #Installations 817,120
ℒ=− 𝑟𝑢𝑖 log 𝑟ˆ𝑢𝑖 + (1 − 𝑟𝑢𝑖 ) log(1 − 𝑟ˆ𝑢𝑖 ), (14) Density 0.359%
(𝑢,𝑖)∈𝒮
Shared #Users 8,514
where the training samples 𝒮 = 𝑅𝑇+ ∪ 𝑅𝑇− are the union of #Clothes (Men) 28,262
observed target interaction matrix and randomly sampled #Ratings/#Reviews 56,050
negative pairs. Usually, |𝑅𝑇+ | = |𝑅𝑇− | and we do not perform Target Density 0.023%
a predefined negative sampling in advance since this can only Amazon Men #Words 1,845,387
generate a fixed training set of negative samples. Instead, Avg. Words Per Review 32.9
#Products (Sports) 41,317
we generate negative samples during each epoch, enabling
Source #Ratings/#Reviews 81,924
diverse and augmented training sets of negative examples to Density 0.023%
be used.
This objective function has a probabilistic interpretation
and is the negative logarithm likelihood of the following During training, we compute the outputs of the three indi-
likelihood function: vidual networks in parallel using mini-batch stochastic opti-
∏︁ ∏︁
𝐿(Θ|𝒮) = 𝑟ˆ
+ 𝑢𝑖 −
(1 − 𝑟ˆ𝑢𝑖 ), (15) mization which can be trained efficiently by back-propagation.
(𝑢,𝑖)∈𝑅𝑇 (𝑢,𝑖)∈𝑅𝑇
MTNet is scalable to the number of the training data. It can
where the model parameters are: easily update when new data examples come, just feeding
Θ = {𝑃 , 𝑄, 𝐻, 𝐴, 𝐶, 𝑊 , 𝑏, 𝑊𝑜 , 𝑊𝑧 , 𝑊𝑐 , ℎ}. (16) them into the training mini-batch. Thus, MTNet can handle
the scalability and dynamics of items and users like in an
Comparing with Eq.(2), instead of modeling all zero entries
online fashion. In contrast, the topic modeling related tech-
(i.e., the whole target matrix 𝑅𝑇 ), we learn from only a small
niques have difficulty in benefitting from these advantages to
subset of such unobserved entries and treat them as negative
this extent.
samples by picking them randomly during each optimization
iteration (i.e., the negative sampling technique).
3 EXPERIMENTS
The objective function can be optimized by stochastic
gradient descent (SGD) and its variants like adaptive moment In this section, we conduct empirical study to answer the
method (Adam) [15]. The update equations are: following questions: 1) how does the proposed MTNet model
perform compared with state-of-the-art recommender sys-
𝜕𝐿(Θ)
Θ𝑛𝑒𝑤 ← Θ𝑜𝑙𝑑 − 𝜂 , (17) tems; and 2) how do the text content and the source domain
𝜕Θ information contribute to the proposed framework. We firstly
where 𝜂 is the learning rate. Typical deep learning library introduce the evaluation protocols and experimental settings,
like TensorFlow (https://www.tensorflow.org) provides au- and then we compare the performance of different recom-
tomatic differentiation and hence we omit the gradient e- mender systems. We further analyze the MTNet model to
quations 𝜕𝐿(Θ)
𝜕Θ
which can be computed by chain rule in understand the impact of the memory and transfer compo-
back-propagation (BP). nent. We also investigate that the improved performance
comes from the cold-users and cold-items to some extent.
2.6 Complexity Analysis
In the model parameters Θ, the embedding matrices 𝑃 , 𝑄 3.1 Experimental Settings
and 𝐻 contain a large number of parameters since they Dataset We evaluate on two real-world cross-domain dataset-
depend on the input size of users and (target and source) s. The first dataset, Mobile1 , is provided by a large inter-
items, and their scale is hundreds of thousands. Typically, net company, i.e., Cheetah Mobile (http://www.cmcm.com/
the number of words, i.e., the vocabulary size is 𝐿 = 8000. en-us/). The information contains logs of user reading news,
The dimension of embeddings is typically 𝑑 = 100. Since the history of app installation, and some metadata such as
the architecture follows a tower pattern, the dimension of news publisher and user gender collected in one month in
the outputs of the three individual networks is also limited the US. We removed users with fewer than 10 feedbacks. For
within hundreds. In total, the size of model parameters is each item, we use the news title as its text content. We filter
linear with the input size and is close to the size of typical stop words and use tf-idf to choose the top 8,000 distinct
latent factors models [27] and one hidden layer neural CF
approaches [10]. 1
An anonymous version can be released later.
KDD’18 Deep Learning Day, August 2018, London, UK Guangneng Hu, Yu Zhang, and Qiang Yang

words as the vocabulary. This yields a corpus of 612K words. A higher value with lower cutoff indicates better performance.
The average number of words per news is less than 10. The Baselines We compare with various baselines, categorized
dataset we used contains 477K user-news reading records as single/cross domain, shallow/deep, and hybrid methods.
and 817K user-app installations. There are 15.8K shared Baselines Shallow method Deep method
users which enable the knowledge transfer between the two Single-Domain BPRMF MLP
domains. We aim to improve the news recommendation by Cross-Domain CDCF, CMF MLP++, CSN
transferring knowledge from app domain. The data sparsity Hybrid HFT, TextBPR LCMR
is over 99.6%. Cross + Hybrid CDCF++ MTNet (ours)
The second dataset is a public Amazon dataset (http://
snap.stanford.edu/data/web-Amazon.html), which has been
widely used to evaluate the performance of collaborative ∙ BPRMF, Bayesian personalized ranking [26], is a la-
filtering approaches [9]. We use the two categories of Amazon tent factor model based on matrix factorization and
Men and Amazon Sports as the cross-domain. The original pair-wise loss. It learns on the target domain only.
ratings are from 1 to 5 where five stars indicate that the user ∙ CDCF, Cross-Domain CF with factorization machines
shows a positive preference on the item while the one stars are (FM) [18], is a cross-domain recommender which ex-
not. We convert the ratings of 4-5 as positive samples. The tends FM [25]. It is a context-aware approach which
dataset we used contains 56K positive ratings on Amazon applies factorization on the merged domains (aligned
Men and 81K positive ratings on Amazon Sports. There are by the shared users). That is, the auxiliary domain is
8.5K shared users, 28K Men products, and 41K Sports goods. used as context. On the Mobile dataset, the context for
We aim to improve the recommendation on the Men domain a user in the target news domain is his/her history of
by transferring knowledge from relevant Sports domain. The app installations in the source app domain. The feature
data sparsity is over 99.7%. We filter stop words and use vector for the input is a sparse vector 𝑥 ∈ R𝑚+𝑛𝑇 +𝑛𝑆
tf-idf to choose the top 8,000 distinct words as the vocabulary. where the non-zero entries are as follows: 1) the index
The average number of words per review is 32.9. for user id, 2) the index for target news id (target do-
The statistics of the two datasets are summarized in Ta- main), and all indices for his/her installed apps (source
ble 2. As we can see, both datasets are very sparse and hence domain).
we hope improve performance by transferring knowledge from ∙ CDCF++: We extend the above CDCF model to
the auxiliary domain and exploiting the text content as well. exploit the text content. The feature vector for the
Note that Amazon dataset are long text of product reviews input is a sparse vector 𝑥 ∈ R𝑚+𝑛𝑇 +𝑛𝑆 +𝐿 where the
(the number of average words per item is 32), while Cheetah non-zero entries are augmented by the word features
Mobile is short text of news titles (the number of average corresponding to the given user-item interaction. In
words per item is 7). this way, CDCF++ can learn from both the source
Evaluation Protocol For item recommendation task, the domain and unstructured text information.
leave-one-out (LOO) evaluation is widely used and we follow ∙ CMF, Collective matrix factorization [27], is a multi-
the protocol in [10]. That is, we reserve one interaction as the relation learning approach which jointly factorizes ma-
test item for each user. We determine hyper-parameters by trices of individual domains. Here, the relation is the
randomly sampling another interaction per user as the valida- user-item interaction. On Mobile, the two matrices are
tion/development set. We follow the common strategy which 𝐴 = “user by news” and 𝐵 = “user by app” respectively.
randomly samples 99 (negative) items that are not interacted The shared user factors 𝑃 enable knowledge transfer
by the user and then evaluate how well the recommender can between two domains. Then CMF factorizes matrices
rank the test item against these negative ones. 𝐴 and 𝐵 simultaneously by sharing the user latent
Since we aim at top-K item recommendation, the typical factors: 𝐴 ≈ 𝑃 𝑇 𝑄𝐴 and 𝐵 ≈ 𝑃 𝑇 𝑄𝐵 . It is a shallow
evaluation metrics are hit ratio (HR), normalized discounted model and jointly learns on two domains. This can
cumulative gain (NDCG), and mean reciprocal rank (MRR), be thought of a non-deep transfer/multitask learning
where the ranked list is cut off at 𝑡𝑜𝑝𝐾 = {5, 10, 20}. HR approach for cross-domain recommendation.
intuitively measures whether the reserved test item is present ∙ HFT, Hidden Factors and hidden Topics [20], adopts
on the top-K list, defined as: topic distributions to learn latent factors from text
reviews. It is a hybrid method.
1 ∑︁
𝐻𝑅 = 𝛿(𝑝𝑢 ≤ 𝑡𝑜𝑝𝐾), (18) ∙ TextBPR extends the basic BPRMF model by inte-
|𝒰| 𝑢∈𝒰
grating text content. It computes the prediction scores
where 𝑝𝑢 is the hit position for the test item of user 𝑢, and by two parts: one is the standard latent factors, same
𝛿(·) is the indicator function. NDCG and MRR also account with the BPRMF; and the other is the text factors
for the rank of the hit position, respectively defined as: learned from the text content. It has two implementa-
1 ∑︁ log 2 tions, the VBPR model [9] and the TBPR model [12]
𝑁 𝐷𝐶𝐺 = , (19) which are the same in essence.
|𝒰| 𝑢∈𝒰 log(𝑝𝑢 + 1)
∙ MLP, multilayer perceptron [10], is a neural CF ap-
1 ∑︁ 1
𝑀 𝑅𝑅 = . (20) proach which learns the nonlinear interaction function
|𝒰| 𝑢∈𝒰 𝑝𝑢
MTNet: A Neural Approach for Cross-Domain Recommendation with Unstructured
KDD’18 Text
Deep Learning Day, August 2018, London, UK

Table 3: Comparison Results of Different Methods on the Mobile Dataset. The best baselines are marked with
asterisks and the best results are boldfaced.

𝑡𝑜𝑝𝐾 = 5 𝑡𝑜𝑝𝐾 = 10 𝑡𝑜𝑝𝐾 = 20


Method
HR NDCG MRR HR NDCG MRR HR NDCG MRR
BPRMF .4380 .3971 .3606 .4941 .4182 .3694 .5398 .4316 .3730
CDCF .5066 .3734 .3293 .5325 .4089 .3441 .5452 .4374 .3519
CMF .4789 .3535 .3119 .5846 .3879 .3263 .6662 .4086 .3320
HFT .4966 .3617 .3175 .5580 .4093 .3365 .6547 .4379 .3445
TextBPR .4948 .4298 .3826 .5466 .4499 .3913 .6123 .4682 .3958
CDCF++ .4981 .3693 .3267 .6055 .4041 .3411 .6244 .4335 .3491
MLP .5380 .4121 .3702 .6176 .4381 .3810 .6793 .4529 .3851
MLP++ .5524 .4284 .3871 .6319 .4535 .3976 .6910 .4691 .4019
CSN .5551* .4323* .3920* .6327* .4574* .4025* .6908 .4732* .4068*
LCMR .5476 .4189 .3762 .6311 .4460 .3874 .6927* .4619 .3918
MTNet .5664 .4427 .4018 .6438 .4680 .4124 .6983 .4820 .4163
Improvement of MTNet 2.04% 2.42% 2.51% 1.75% 2.32% 2.47% 0.81% 1.86% 2.34%

using neural networks. It is a deep model learning on thus implemented in company. Our methods are implemented
the target domain only. using TensorFlow. Parameters are randomly initialized from
∙ MLP++: We combine two MLPs by sharing the user Gaussian 𝒩 (0, 0.012 ). The optimizer is Adam with initial
embedding matrix, enabling the knowledge transfer learning rate 0.001. The size of mini batch is 128. The ratio
between two domains through the shared users. It is of negative sampling is 1. The MLP and MLP++ follows
a naive knowledge transfer approach applied for cross- a tower pattern, halving the layer size for each successive
domain recommendation. higher layer. Specifically, the configuration of hidden layers
∙ CSN, Cross-stitch network [21], is a deep multitask in the base MLP network is [64 → 32 → 16 → 8] as refer-
learning model originally proposed for visual recogni- ence in the original paper [10]. For CSN, it requires that the
tion tasks. We use the cross-stitch units to stitch two number of neurons in each hidden layer is the same and the
MLP networks. It learns a linear combination of activa- configuration is [64] * 4 (equals [64 → 64 → 64 → 64]). We
tion maps from two networks and hence benefits from investigate several typical configurations {16, 32, 64, 80} * 4 .
each other. Comparing with MLP++, CSN enables The dimension of embeddings is 𝑑 = 75.
knowledge transfer also in the hidden layers besides
the lower embedding matrices. This is a deep transfer 3.2 Comparison Results
learning approach for cross-domain recommendation.
∙ LCMR, Local and Centralized Memory Recommender [14], In this section, we report the recommendation performance of
is a deep model for collaborative filtering with unstruc- different methods and discuss the findings. The comparison
tured Text. The local memory module is similar to our results are shown in Table 3 and Table 4 respectively on
MNet except that we only have one layer. This is a the Mobile and Amazon datasets where the last row is the
deep hybrid method. relative improvement of ours vs the best baseline. We have
the following observations.
Implementation For BPRMF, we use LightFM’s imple- Firstly, we can see that our proposed neural models are
mentation2 which is a popular CF library. For CDCF and better than all baselines on the two datasets at each setting,
CDCF++, we adapt the official libFM implementation3 . For including the base MLP network, shallow cross-domain mod-
CMF, we use a Python version reference to the original els (CMF and CDCF), deep cross-domain models (MLP++
Matlab code4 . For HFT and TextBPR, we use the code and CSN), and hybrid methods (HFT and TextBPR, LCMR).
released by their authors5 . The word embeddings used in These results demonstrate the effectiveness of the proposed
the TextBPR are pre-trained by GloVe6 . For latent factor neural model.
models, we vary the number of factors from 10 to 100 with On the Mobile dataset, the differences between MTNet
step size 10. For MLP, we use the code released by its au- and other methods are more pronounced for small numbers
thors7 . The MLP++ and CSN are implemented based on of recommended items including top-5 or top-10 where we
MLP. The LCMR model is similar to our MNet model and achieve average 2.25% relative improvements over the best
2
https://github.com/lyst/lightfm baseline. This is a desirable feature since we often recommend
3
http://www.libfm.org only a small number of top ranked items to consumers to
4
http://www.cs.cmu.edu/~ajit/cmf/ alleviate the information overload issue.
5
http://cseweb.ucsd.edu/~jmcauley/
6
https://nlp.stanford.edu/projects/glove/
Note that the relative improvement of the proposed model
7
https://github.com/hexiangnan/neural_collaborative_filtering vs. the best baseline is more significant on the Amazon dataset
KDD’18 Deep Learning Day, August 2018, London, UK Guangneng Hu, Yu Zhang, and Qiang Yang

Table 4: Comparison Results of Different Methods on the Amazon Dataset. The best baselines are marked
with asterisks and the best results are boldfaced.

𝑡𝑜𝑝𝐾 = 5 𝑡𝑜𝑝𝐾 = 10 𝑡𝑜𝑝𝐾 = 20


Method
HR NDCG MRR HR NDCG MRR HR NDCG MRR
BPRMF .0810 .0583 .0509 .1204 .0710 .0561 .1821 .0864 .0602
CDCF .1295 .0920 .0797 .2070 .1167 .0897 .3841 .1609 .1015
CMF .1498 .0950 .0771 .2224 .1182 .0863 .3573 .1521 .0957
HFT .1077 .0815 .0729 .1360 .0907 .0767 .2782 .1252 .0854
TextBPR .1517 .1208 .1104 .1777 .1291 .1138 .2268 .1414 .1171
CDCF++ .1314 .0926 .0800 .2102 .1177 .0901 .3822 .1605 .1016
MLP .2100 .1486 .1283 .2836 .1697 .1371 .3820 .1899 .1426
MLP++ .2263 .1626 .1417 .2992 .1862 .1514 .3810 .2069 .1570
CSN .2340* .1680* .1462* .3018* .1898* .1552* .3944* .2091* .1605*
LCMR .2024 .1451 .1263 .2836 .1678 .1356 .3951 .1918 .1420
MTNet .2575 .1796 .1550 .3490 .2077 .1666 .4443 .2311 .1727
Improvement of MTNet 10.04% 6.90% 6.01% 15.63% 9.43% 7.34% 12.65% 10.52% 7.60%

0.6 0.65
than that on the Mobile dataset, obtaining average 9.56% MTNet\M\T MTNet\M\T
MTNet\M MTNet\M
relative improvements over the best CSN baseline, though 0.55 MTNet\T 0.6 MTNet\T
MTNet MTNet
the Amazon is sparser than the Mobile (see Table 2). One

Performance@10
Performance@5 0.55
explanation is that the relatedness of the Men and Sports do- 0.5

0.5
mains is closer than that between the news and app domains.
0.45
This will benefit all cross-domain methods including CMF, 0.45

CDCF, MLP++, and CSN, since they exploit information 0.4


0.4
from both two domains. Another explanation is that the text
content contains richer information on the Amazon dataset. 0.35
HR NDCG MRR
0.35
HR NDCG MRR
0.3 0.35
As it is shown in Table 2, the average words in the product MTNet\M\T MTNet\M\T
MTNet\M MTNet\M
reviews are longer that in the news titles. This will benefit MTNet\T 0.3 MTNet\T
0.25 MTNet MTNet
all hybrid methods including HFT, TextBPR, and LCMR.

Performance@10
Performance@5

The hybrid TextBPR model composes a document repre- 0.25

sentation by averaging the words’s embeddings. This can not 0.2


0.2
distinguish the important words to match the user prefer-
ences. This may explain that it has difficulty in improving the 0.15
0.15
recommendation performance when integrating text content.
For example, it can not consistently outperform the pure CF 0.1
HR NDCG MRR
0.1
HR NDCG MRR
method, MLP. The cross-domain CSN model transfers every
representations from the source network with the same coeffi- Figure 2: Contributions from Unstructured Text (M-
cient. This may have a risk in transferring the noise and harm Net) and Cross-Domain Knowledge (TNet) on the
the performance, as pointed out in its sparse variant [13]. Mobile (Top) and Amazon (Bottom) Datasets.
On the Amazon dataset, it loses to the proposed model by a
large margin (though MTNet leverages content information).
In contrast, the memory and transfer components are both
selective to extract useful information based on the attention text content and source domain knowledge for recommenda-
mechanism. This may explain that our model is consistently tion.
the best at all settings.
There is a possibility that the noise from auxiliary domain 3.3 Impact of Unstructured Text and
and some irrelevance information contained in the unstruc- Auxiliary Domain
tured text propose a challenge for exploiting them. This We have shown the effectiveness of the two memory and
shows that the proposed model is more effective since it can transfer components together in the proposed framework.
select useful representations from the source network and at- We now investigate the contribution of each network to the
tentively focus on the important words to match preferences MTNet by eliminating the impact of text content and source
of users. domain from it in turn:
In summary, the empirical comparison results demonstrate
the superiority of the proposed neural model to exploit the ∙ MTNet∖M∖T: Eliminating the impact of both con-
tent and source information from MTNet. This is a
MTNet: A Neural Approach for Cross-Domain Recommendation with Unstructured
KDD’18 Text
Deep Learning Day, August 2018, London, UK

350 1400
MTNet
MLP
MTNet
MLP
history from the related source domain. MTNet alleviates
300 1200
the cold-item start issue by exploiting the associated text
1000

Num of Missed Users


250
content to reveal its properties, semantics, and topics. We now
Num of Missed Users

200 800
investigate that MTNet indeed improves the performance
150 600 over the cold users and items by comparing with the pure
100 400 neural collaborative filtering method, MLP.
50 200 We analyse the distribution of missed hit users (MHUs)
0 0 of MTNet and MLP (at cutoff 10). We expect that the cold
0 50 100 150 200 0 10 20 30 40
Num of Training Examples Num of Training Examples users in MHUs of MLP can be reduced by using the MTNet
model. The more amount we can reduce, the more effective
Figure 3: The Missed Hit Users Distribution (not that MTNet can alleviate the cold-user start issues.
normalized) Over the Number of Training Examples The results are shown in Figure 3 where the number of
on the Mobile (Left) and Amazon (Right) Datasets. training examples can measure the “coldness” of a user. Nat-
urally, the MHUs are most of the cold users who have few
training examples. As we can see, the number of cold users in
collaborative filtering recommender. Actually, it is e- MHUs of MLP is higher than that of MTNet. If the cold users
quivalent to a single hidden layer MLP model. are defined as those with less than seven training examples,
∙ MTNet∖M: Eliminating the impact of content in- then MTNet reduces the number of cold users from 4,218
formation (MNet) from MTNet. This is a novel cross- to 3,746 on the Amazon dataset, achieving relative 12.1%
domain recommender which can adaptively select source reduction. On the Mobile dataset, if the cold users are those
items to transfer. with less than ten training examples (Mobile is denser than
∙ MTNet∖T: Eliminating the impact of source infor- Amazon), then MTNet reduces the number of cold users from
mation (TNet) from MTNet. This is a novel hybrid 1,385 to 1,145 on the Mobile dataset, achieving relative 20.9%
filtering recommender which can attentively select con- reduction. These results show that the proposed model is
tent words to exploit. effective in alleviating the cold-user start issue. The results
The ablation analyses of MTNet and its components are on cold items are similar and we omit them due to the page
shown in Figure 2. The performance degrades when either limit.
memory or transfer modules are eliminated. This is under-
standable since we lose some information. In other words,
the two components can extract useful knowledge to improve 4 RELATED WORKS
the recommendation performance. For example, MTNet∖T Recommender systems aim at learning user preferences on
and MTNet∖M respectively reduce 1.1% and 4.3% relative unknown items from their past history. Content-based recom-
NDCG@10 performance by comparing with MTNet on the mendations are based on the matching between user profiles
Mobile dataset (they are 8.5% and 16.1% on Amazon), sug- and item descriptions. It is difficult to build the profile for
gesting that both memory and transfer networks learn essen- each user when there is no/few content. Collaborative filter-
tial knowledge for recommendation. On the evaluated two ing (CF) alleviates this issue by predicting user preferences
datasets, removing the memory component degrades perfor- based on the user-item interaction behavior, agnostic to the
mance worse than that of removing the transfer component. content [5]. Latent factor models learn feature vectors for
This may be due to that the text content contains richer users and items mainly based on MF [16] which has proba-
information or the source domain contains much more noise bilistic interpretations [22]. FM can mimic MF [25]. Neural
or both. networks are proposed to push the learning of feature vec-
tors towards non-linear representations, including the NNMF
3.4 Improvement on Cold Users and Items and MLP [6, 10]. The basic MLP architecture is extended
The cold-user and cold-item problems are common issues in to regularize the factors of users and items by social and
recommender systems. When new users enter into a system, geographical information [33]. Other neural approaches learn
they have no history that can be exploited by the recom- from the explicit feedback for rating prediction task [4, 36].
mender system to learn their preferences, leading to the We focus on learning from the implicit feedback for top-N
cold-user start problem. Similarly, when latest news are re- recommendation [32]. CF models, however, suffer from the
leased on the Google News, there are no reading records that data sparsity issue.
can be exploited by the recommender system to learn users’ Items are usually associated with the content information
preferences on them, leading to the cold-item start problem. such as unstructured text (e.g., abstracts of articles and re-
In general, it is very hard to train a reliable recommender views of products). CF approaches can be extended to exploit
system and make predictions for users and items that have the content information [1, 30, 31] and user reviews [9, 12, 20].
few interactions. Memory networks can reason with an external memory [28].
Intuitively, the proposed model can alleviate both the cold- Due to the capability of naturally learning word embeddings
user and cold-item start issues. MTNet alleviates the cold- to address the problems of word sparseness and semantic gap,
user start issue in the target domain by transferring his/her a memory module can be used to model item content [14] or
KDD’18 Deep Learning Day, August 2018, London, UK Guangneng Hu, Yu Zhang, and Qiang Yang

the neighborhood of users [7]. We follow this research thread [11] G. Hinton, O. Vinyals, and J. Dean. 2015. Distilling the knowledge
by using neural networks to attentively extract important in a neural network. arXiv preprint arXiv:1503.02531 (2015).
[12] G. Hu and X. Dai. 2017. Integrating reviews into personalized
information from the text content. ranking for cold start recommendation. In Pacific-Asia Knowl-
Cross-domain recommendation [2] is an effective technique edge Discovery and Data Mining.
[13] G. Hu, Y. Zhang, and Q. Yang. 2018. CoNet: Collaborative Cross
to alleviate sparse issue. A class of methods are based on Networks for Cross-Domain Recommendation. arXiv preprint
MF applied to each domain, including CMF [27] with its arXiv:1804.06769 (2018).
heterogeneous [24] variants, and codebook transfer [17]. Het- [14] G. Hu, Y. Zhang, and Q. Yang. 2018. LCMR: Local and Central-
ized Memories for Collaborative Filtering with unstructured Text.
erogeneous cross-domain [34], multiple source domains [19], arXiv preprint arXiv:1804.06201 (2018).
and multi-view learning [8] are also proposed. Transfer learn- [15] D. Kingma and J. Ba. 2015. Adam: A method for stochastic
ing (TL) aims at improving the performance of the target optimization. ICLR.
[16] Y. Koren, R. Bell, and C. Volinsky. 2009. Matrix factorization
domain by exploiting knowledge from source domains [23]. techniques for recommender systems. Computer (2009).
Similar to TL, the multitask learning (MTL) is to leverage [17] B. Li, Q. Yang, and X. Xue. 2009. Can movies and books collab-
orate?: cross-domain collaborative filtering for sparsity reduction.
useful knowledge in multiple related tasks to help each oth- In IJCAI.
er [3, 35]. The cross-stitch network [21] enables information [18] B. Loni, Y. Shi, M. Larson, and A. Hanjalic. 2014. Cross-Domain
sharing between the two base networks. We follow this re- Collaborative Filtering with Factorization Machines. In European
conference on information retrieval.
search thread by using neural networks to selectively transfer [19] Z. Lu, E. Zhong, L. Zhao, E. Xiang, W. Pan, and Q. Yang. 2013.
knowledge from the source items. Selective transfer learning for cross domain recommendation. In
SIAM International Conference on Data Mining.
[20] J. McAuley and J. Leskovec. 2013. Hidden factors and hidden
5 CONCLUSION topics: understanding rating dimensions with review text. In ACM
RecSys.
It is shown that the text content and the source domain [21] I. Misra, A. Shrivastava, A. Gupta, and M. Hebert. 2016. Cross-
knowledge can help improve recommendation performance stitch networks for multi-task learning. In IEEE CVPR.
[22] A. Mnih and R. Salakhutdinov. 2008. Probabilistic matrix factor-
and can be integrated under a neural architecture. The sparse ization. In NIPS.
target user-item interaction matrix can be reconstructed with [23] S. Pan and Q. Yang. 2010. A survey on transfer learning. IEEE
the knowledge guidance from both of the two kinds of in- Transactions on knowledge and data engineering (2010).
[24] W. Pan, N. Liu, E. Xiang, and Q. Yang. 2011. Transfer learning
formation, alleviating the data sparse issue. We proposed to predict missing ratings via heterogeneous user feedbacks. In
a novel deep neural model, MTNet, for cross-domain rec- IJCAI.
ommendation with unstructured text. MTNet consists of a [25] S. Rendle. 2012. Factorization machines with libfm. ACM Trans-
actions on Intelligent Systems and Technology (2012).
memory component which can attentively focus important [26] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme.
words to match user preferences and a transfer component 2009. BPR: Bayesian personalized ranking from implicit feedback.
In UAI.
which can selectively transfer useful source items to benefit [27] A. Singh and G. Gordon. 2008. Relational learning via collective
the target domain. MTNet shows better performance than matrix factorization. In ACM SIGKDD.
various baselines on two real-world datasets under differen- [28] S. Sukhbaatar, J. Weston, and R. Fergus. 2015. End-to-end
memory networks. In NIPS.
t settings. Additionally, we conducted ablation analyses to [29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.
understand contributions from the two memory and transfer Gomez, Ł. Kaiser, and I. Polosukhin. 2017. Attention is all you
components. We quantify the amount of missed hit cold users need. In NIPS.
[30] C. Wang and D. Blei. 2011. Collaborative topic modeling for
(and items) that MTNet can reduce by comparing with the recommending scientific articles. In ACM SIGKDD.
pure CF method, showing that MTNet is able to alleviate [31] H. Wang, N. Wang, and D. Yeung. 2015. Collaborative deep
learning for recommender systems. In ACM SIGKDD.
the cold-start issue. [32] Y. Wu, C. DuBois, A. Zheng, and M. Ester. 2016. Collaborative
denoising auto-encoders for top-n recommender systems. In ACM
WSDM.
REFERENCES [33] C. Yang, L. Bai, C. Zhang, Q. Yuan, and J. Han. 2017. Bridging
[1] T. Bansal, D. Belanger, and A. McCallum. 2016. Ask the gru: Collaborative Filtering and Semi-Supervised Learning: A Neural
Multi-task learning for deep text recommendations. In ACM Approach for POI Recommendation. In ACM SIGKDD.
RecSys. [34] D. Yang, J. He, H. Qin, Y. Xiao, and W. Wang. 2015. A graph-
[2] I. Cantador, I. Fernández-Tobías, S. Berkovsky, and P. Cremonesi. based recommendation across heterogeneous domains. In ACM
2015. Cross-domain recommender systems. In Recommender CIKM.
Systems Handbook. [35] Y. Zhang and Q. Yang. 2017. A survey on multi-task learning.
[3] R. Caruana. 1997. Multitask Learning. Machine Learning (1997). arXiv:1707.08114 .
[4] R. Catherine and W. Cohen. 2017. TransNets: Learning to Trans- [36] L. Zheng, V. Noroozi, and P. Yu. 2017. Joint deep modeling
form for Recommendation. In ACM RecSys. of users and items using reviews for recommendation. In ACM
[5] M. Deshpande and G. Karypis. 2004. Item-based top-n recommen- WSDM.
dation algorithms. ACM Transactions on Information Systems.
[6] G. Dziugaite and D. Roy. 2015. Neural network matrix factoriza-
tion. arXiv:1511.06443 .
[7] T. Ebesu, B. Shen, and Y. Fang. 2018. Collaborative Memory
Network for Recommendation Systems. In ACM SIGIR.
[8] A. Elkahky, Y. Song, and X. He. 2015. A multi-view deep learning
approach for cross domain user modeling in recommendation
systems. In WWW.
[9] R. He and J. McAuley. 2016. VBPR: visual Bayesian Personalized
Ranking from implicit feedback. In AAAI.
[10] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua. 2017.
Neural collaborative filtering. In WWW.

You might also like