0% found this document useful (0 votes)
4 views29 pages

Seamlessly Unifying Attributes and Items: Conversational Recommendation For Cold-Start Users

Uploaded by

Darshan Chhajed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views29 pages

Seamlessly Unifying Attributes and Items: Conversational Recommendation For Cold-Start Users

Uploaded by

Darshan Chhajed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Seamlessly Unifying Attributes and Items: Conversational

Recommendation for Cold-Start Users


SHIJUN LI∗ , University of Scinece and Technology of China, China
WENQIANG LEI†∗ , National University of Singapore, Republic of Singapore
QINGYUN WU, University of Virginia, United States
arXiv:2005.12979v5 [cs.IR] 5 Oct 2022

XIANGNAN HE, University of Scinece and Technology of China, China


PENG JIANG, Kuaishou Inc., China
TAT-SENG CHUA, National University of Singapore, Republic of Singapore
Static recommendation methods like collaborative filtering suffer from the inherent limitation of performing
real-time personalization for cold-start users. Online recommendation, e.g., multi-armed bandit approach,
addresses this limitation by interactively exploring user preference online and pursuing the exploration-
exploitation (EE) trade-off. However, existing bandit-based methods model recommendation actions homoge-
neously. Specifically, they only consider the items as the arms, being incapable of handling the item attributes,
which naturally provide interpretable information of user’s current demands and can effectively filter out
undesired items. In this work, we consider the conversational recommendation for cold-start users, where a
system can both ask the attributes from and recommend items to a user interactively. This important scenario
was studied in a recent work [54]. However, it employs a hand-crafted function to decide when to ask attributes
or make recommendations. Such separate modeling of attributes and items makes the effectiveness of the
system highly rely on the choice of the hand-crafted function, thus introducing fragility to the system. To
address this limitation, we seamlessly unify attributes and items in the same arm space and achieve their EE
trade-offs automatically using the framework of Thompson Sampling. Our Conversational Thompson Sampling
(ConTS) model holistically solves all questions in conversational recommendation by choosing the arm with
the maximal reward to play. Extensive experiments on three benchmark datasets show that ConTS outperforms
the state-of-the-art methods Conversational UCB (ConUCB) [54] and Estimation–Action–Reflection model [27]
in both metrics of success rate and average number of conversation turns.
CCS Concepts: • Computing methodologies → Artificial intelligence; Machine learning algorithms;
• Human-centered computing → Collaborative and social computing design and evaluation methods.
Additional Key Words and Phrases: Conversational Recommendation; Interactive Recommendation; Recom-
mender System; Dialogue System
∗ Both authors contributed equally to this research.
† Wenqiang Lei is the corresponding author.

Authors’ addresses: Shijun Li, [email protected], University of Scinece and Technology of China, 443 Huangshan
Road, He Fei, China, 230027; Wenqiang Lei, [email protected], National University of Singapore, 13 Computing
Drive, National University of Singapore, Singapore, Republic of Singapore; Qingyun Wu, [email protected], University of
Virginia, Computer Science 85 Engeers’s way, Charlottesville, VA, United States, 22903; Xiangnan He, [email protected],
University of Scinece and Technology of China, 443 Huangshan Road, He Fei, China, 230027; Peng Jiang, [email protected],
Kuaishou Inc., 6 West Shangdi Road , Haidian District, Beijing, China; Tat-Seng Chua, [email protected], National University
of Singapore, 13 Computing Drive, National University of Singapore, Singapore, Republic of Singapore, 117417.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2021 Association for Computing Machinery.
1046-8188/2021/8-ART40 $15.00
https://doi.org/10.1145/3446427

ACM Transactions on Information Systems, Vol. 39, No. 4, Article 40. Publication date: August 2021.
40
40:2 Li and Lei, et al.

ACM Reference Format:


Shijun Li, Wenqiang Lei, Qingyun Wu, Xiangnan He, Peng Jiang, and Tat-Seng Chua. 2021. Seamlessly Unifying
Attributes and Items: Conversational Recommendation for Cold-Start Users. ACM Transactions on Information
Systems 39, 4, Article 40 (August 2021), 29 pages. https://doi.org/10.1145/3446427

1 INTRODUCTION
Recommendation system plays an increasingly important role in the current era of information
explosion. In the past decades, most recommendation research has focused on advancing modeling
historical user behaviours, proposing various methods like traditional collaborative filtering [20, 39],
content- and context- aware filtering [7, 38], to the recently prevalent neural network methods [9, 19]
and graph-based methods [18, 48]. We call such methods as static recommendation as they learn
static user preference from past interaction history, which inevitably requires an adequate amount
of past behaviour histories for each user. However, the dependency of historical data makes such
methods suffer from the cold-start scenario where new users come with no past historical data.
Cold-start users typically interact with the system very shortly but expect to receive high-quality
recommendations. Moreover, their preferences might dynamically change over time, making the
recommendation even harder. The recent emerging application and technology of conversational
recommendation system (CRS) bring a promising solution to the cold-start problem. A CRS is
envisioned as an interactive system that can ask the user preference towards item attributes
before making the recommendations. With preferred attributes known, the item pool can be
significantly reduced [44] and the candidate items can be better scored [27], leading to more desired
recommendations. Owing to this great potential, conversational recommendation becomes an
emerging topic which attracts extensive research attention recently [8, 10, 11, 35, 42, 44, 52, 54, 55].
On the industry side, we have also witnessed an increasing trend on exploring the technology
of CRS in various scenarios. A classical scenario is the customer service chatbot in E-commerce,
which can help users find their interested products by dialogues. Figure 1(a) shows the interface of
the chatbot in Taobao1 : when a user comes into the system, the bot can ask questions to elicit user
preferences such as asking the user preferred attribute (e.g., brand and categories); then the bot
makes recommendations according to the user’s responses.
In addition to the chatbot, many scenarios that interact with users to facilitate information
seeking can be abstracted as the problem of CRS. Taking Kuaishou app2 as an example, when the
user browses videos, the app will pop up a box to ask the user’s favorite attributes of the videos (see
Figure 1(b)). This can be formalized as a conversational recommendation process [27, 43, 45, 49, 55],
which is characterized as form-based conversational recommendation system in [22]. Popping
up attribute box can be formalized as ”asking what attributes a user likes” in the conversational
recommendation while displaying videos can be formalized as making recommendation actions.
Such mechanism has demonstrated to be effective hence been widely adopted in many applications
like YouTube3 , Spotify4 , Steam5 .
As an emerging topic, conversational recommendation has been explored under various settings
with different focuses [8, 10, 11, 35, 42, 44, 52, 54, 55]. Focusing on cold-start users, the first attempt
is made by [11], which identified the key of cold-start conversational recommendation as pursuing
the exploration-exploitation (EE) trade-off. It addresses the problem with the multi-arm bandit
approach [29], modeling items as the arms to achieve the EE trade-off. However, this method is
1 The largest Chinese E-commerce platform: https://consumerservice.taobao.com/
2A famous Chinese video-sharing platform: https://www.kuaishou.com/
3 A famous online video-sharing platform: https://www.youtube.com/
4 An international media services provider: https://www.spotify.com/us/
5 The largest game platform: https://store.steampowered.com/

ACM Transactions on Information Systems, Vol. 39, No. 4, Article 40. Publication date: August 2021.
Seamlessly Unifying Attributes and Items: Conversational Recommendation for Cold-Start Users 40:3

(a) Taobao (b) Kuaishou

Fig. 1. The chatbot interface in Taobao (a) and the pop-up box of attribute asking in Kuaishou (b).

inefficient by soliciting user preference only of the level of items, since the number of items can
be million or even billion in real-world applications. To address this issue, a recent work [54]
integrates the attribute asking into the system. It models attributes and items as two different types
of arms, using a manually-defined function to determine the timing of attribute asking (e.g., asking
an attribute per three turns). We argue that a simple heuristic function is insufficient to find the best
timing of attribute asking, which depends on many ad-hoc factors like the number of current turn
and user’s feedback. The strategy of attribute asking is critical for the usability and effectiveness of
a CRS, which aims to identify the desired item(s) with the fewest turns, so as to avoid bothering
the user with too many questions. However, it has not been seriously considered in conversational
recommendation with EE trade-off.
In this work, we explore the central theme of conversational recommendation for cold-start
users. We highlight two desired properties for such a CRS: 1) deciding the strategy of asking
and recommending in an intelligent way, and 2) keeping the balance between exploiting known
preferences and exploring new interests to achieve successful recommendations. To this end, we
propose the Conversational Thompson Sampling (ConTS) method, modeling attribute asking and
item recommending in the unified framework of Thompson Sampling. The key idea is to model
attributes and items as undifferentiated arms in the same arm space, where the reward estimation for
each arm is jointly decided by the user’s representation, candidate items, and candidate attributes
to capture the mutual promotion of attributes and items. Through this way, we seamlessly unify
attributes and items in conversational recommendation, leading to two main advantages of ConTS:
(1) it addresses conversation policy questions in CRS proposed in [27] — what items to recommend,
what attributes to ask, and whether to ask or recommend in a turn — as the single problem of arm
choosing, which is optimized for maximized rewards; and (2) it inherits the sampling and updating
mechanism of contextual Thompson Sampling [1], being able to achieve the EE balance naturally.
The main contributions of this paper are as follows:
• We study the new task of conversational recommendation for cold-start users with attribute
preference. By modeling attributes and items as indiscriminate arms in the same space, we

ACM Transactions on Information Systems, Vol. 39, No. 4, Article 40. Publication date: August 2021.
40:4 Li and Lei, et al.

propose a holistic solution named ConTS to solve the three conversation policy questions in CRS
in an end-to-end manner. Meanwhile, we apply contextual Thompson Sampling to conversational
recommendation to keep an EE balance in the cold-start scenario.
• We conduct experiments on two existing datasets from Yelp and LastFM, and contribute a new
dataset for cold-start conversational recommendation evaluation from Kuaishou (video-click
records). Extensive experiments show that our ConTS outperforms the state-of-the-art CRS
methods Conversational UCB (ConUCB) [54] and Estimation–Action–Reflection (EAR) [27] in
both metrics of success rate and average turn for cold-start users. Further analysis shows the
importance of exploration and the effectiveness of ConTS’s action policy. All codes and data will
be released to facilitate the research community studying CRS6 .

2 RELATED WORKS
In the field of recommendation system, many static recommendation methods have achieved good
performance. For example, matrix factorization [25], factorization machines [38], NeuralFM [17]
and DeepFM [16] make use of historical user-item interaction records to estimate users’ static
preference based on collaborative filtering hypothesis. Recently some graph based recommendation
models [46, 48] attracts much attention for the ability of graph to describe complex relations
between different agents. However, most of these works suffer from intrinsic problems for cold-
start users as they need to be trained on a large offline dataset to make recommendation based on
historical logs.
This limitation motivates research efforts in online recommendation in [30]. Among them,
muti-arm bandit [3, 13, 29, 33, 47, 50, 51] is a prominent type of approach. Bandit methods aim
to maximize the accumulated reward for playing arms during a period of time by solving the
exploit-explore problem. The most popular bandit algorithms can be divided into two categories:
Thompson Sampling [1, 14, 15, 31, 33] and UCB [3, 26, 29]. Different from Thompson Sampling
which samples from a changeable distribution, UCB keeps an upper confidence bound to keep
EE balance. Some research works [6, 34, 40] have shown that Thompson Sampling has a better
performance than UCB both theoretically and empirically. However, bandit algorithms only work
in small arm pools, which usually requires a separate pre-processing on the candidate pool [6, 29].
It is because the large item pool makes it hard to efficiently hit the user-preferred item duration
exploration.
Conversational recommendation systems (CRS) [8, 10, 11, 35, 42, 44, 52, 54, 55] provides a
promising solution. To enhance the performance of recommendation system with a large item
pool, conversational recommendation ask questions to directly acquire the user’s preference. Such
information can help a CRS to adjust its recommendation strategy effectively and enhance its
performance. Extensive research efforts have been recently devoted to explore this topic spreading
various task formulations. Jannach et al. [22] classify CRS into two categories: NLP-based and
form-based models. CRSs based on natural language [23, 24, 28, 37] are dynamic in terms of the
possible dialogue flow. Works in this direction focus on how to understand users’ preferences and
intentions from their utterances in various forms and generate fluent responses so as to deliver
natural and effective dialogues [23, 24, 28, 37]. In form-based CRS [5, 12, 21, 32], users typically
follow pre-defined dialogue paths and interact with the application by filling forms through pre-
defined options. These models are based on forms (e.g., buttons, radio buttons) and structured text
for the output (e.g., in the form of a list of options). The example given in Section 1 of the Kuaishou
app also belongs to this category. It requires the user to choose his favorite attributes listed in the
box, which is an option pre-defined by the system. However, most of the models are designed for
6 https://github.com/xiwenchao/conTS-TOIS-2021

ACM Transactions on Information Systems, Vol. 39, No. 4, Article 40. Publication date: August 2021.
Seamlessly Unifying Attributes and Items: Conversational Recommendation for Cold-Start Users 40:5

existing users instead of considering cold-start users with few historical data. A typical example is
the EAR [27], which is trained on existing user’s interaction records to estimate the preference of
these users, thus fail to handle new users without historical information.
There are pioneer works studying conversational recommendation problems for cold-start
users. They basically leverage bandit methods to achieve EE balance. Christakopoulou et al. [11]
systematically compare several bandit methods (i.e., Thompson Sampling, UCB, Greedy, Max
Item Trait and Min Item Trait) and find that Thompson Sampling achieves the best performance.
However, the resultant CRS is only able to recommend items to users but not able to ask question
about the attributes. We argue that it is more efficient to explore the item space with the information
of item attributes. In many real-world applications, the number of items could be very large-scale,
e.g., there are over billions of products (videos) in Alibaba (YouTube). Soliciting user preference
in the level of item attribute is far more efficient than that at the level of item ; thus if we know
what attributes are preferred by the user now, we can safely filter out the items that do not contain
the preferred attributes and facilitate efficient product searching. For example, if a user specifies
that he wants to watch Science Fiction movie, we can exclude other categories of movies for the
consideration.
A very recent work by Zhang et al. [54] further considers attribute asking in the framework
of UCB [3, 26, 29] for conversational recommendation, which helps to keep EE balance when
recommending for cold-start users. It models the attributes and items separately as two set of arms
in different arm spaces and decides whether to ask or recommend in a totally handcrafted way.
Specifically, the authors define a frequency function 𝑏 (𝑡) = 5 ∗ ⌊log(𝑡)⌋ where 𝑡 denotes the current
conversation turn. ConUCB asks attributes only if 𝑏 (𝑡) − 𝑏 (𝑡 − 1) > 0. However, we argue such
handcrafted way is not robust. In contrast, we model attributes and items as undifferentiated arms
in the same arm space, and nicely fit the arm choosing in the framework of contextual Thompson
Sampling. As such, all the three conversation policy questions (e.g., what item to recommend; what
attribute to ask; and whether to ask attribute or to recommend) are seamlessly modeled as the
single problem of arm choosing. This results in a holistic model that is more robust and efficient
than ConUCB.

3 PRELIMINARY
This paper studies conversational recommendation for cold-start users. We choose the multi-round
conversational recommendation (MCR) scenario [27] as the problem setting because it is more
reflective of real-world scenario [27]. It is worth to mention that the MCR is a type form-based [22]
setting, which means users interact with the application by choosing from pre-defined options
instead of dynamically generated natural languages. In this section, we first introduce the problem
setting, and then sketch Thompson Sampling since our method is inspired from it. For the ease of
understanding, the complex mathematics are skipped. We refer the reader to [1, 14] for more details.
We give an intuitive illustration about how Thompson Sampling works in Section 7 (Supplementary).

3.1 Problem Setting


In MCR, the system can perform attribute asking and item recommending multiple times until
the recommendation is accepted or a maximum number of turns is reached. The objective is to
achieve successful recommendations with the fewest conversation turns for each cold-start user
𝑢 who has no past historical interaction records. Let V denotes all the items and P denotes all
item attributes. Each item 𝑣 is described by a set of categorical attributes P𝑣 , e.g., the attributes
of restaurants may include location, category, price level, and so on. In each turn 𝑡 (𝑡 = 1, 2, ...,𝑇 ,
where 𝑇 is the maximum number of turns) in a session, the system maintains two candidate pools,

ACM Transactions on Information Systems, Vol. 39, No. 4, Article 40. Publication date: August 2021.
40:6 Li and Lei, et al.

Table 1. A summary of main notations used in the paper

Notation Implication
𝑢, U User and collection of all users
𝑣, V Item and collection of all items
𝑝, P Attribute and collection of all attributes
u, v, p Embedding of user, item and attribute
𝑎, x𝑎 Arm and embedding of arm 𝑎 in Bandit algorithms
𝑟𝑎 Reward gotten from the user on arm 𝑎
A Collection of all arms
B𝑢 , f𝑢 , 𝝁𝑢 , 𝑙 Parameters for user 𝑢 in contextual Thompson Sampling
P𝑢 Currently known user 𝑢’s preferred attributes
P𝑣 Item 𝑣’ attributes
𝑑 Dimension of embedding of user, item and attribute
𝑘 The number of recommended items in one turn
𝑇 The max turn of a conversation in a CRS
V𝑘 Top 𝑘 items in the candidate item list
u𝑖𝑛𝑖𝑡 Initialization of user embedding
u𝑜𝑟𝑖𝑔 User embedding in original contextual Thompson Sampling
𝑟 𝑎′ The de-biased reward in ConTS
𝐼𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 , 𝐼𝑡𝑒𝑠𝑡𝑖𝑛𝑔 , 𝐼 𝑣𝑎𝑙𝑖𝑑𝑎𝑡𝑖𝑜𝑛 The set for training, testing and validation

one for items V𝑡 ⊆ V and another for attributes P𝑡 ⊆ P, from which we select attribute(s) to ask
and top-𝑘 items to recommend, respectively. Note that if it exceeds the maximum conversation
turn 𝑇 , we assume that the user will quit the conversation due to limited patience. In the initial
turn (𝑡 = 1), the candidate pools contain all the items and attributes, which are gradually reduced
with the proceeding of the conversation process.
With V𝑡 and P𝑡 established in the beginning of turn 𝑡, the system takes an action to interact with
the user — either asking for attribute(s) or recommending 𝑘 items. Then the user will give feedback
to the system, which can either be 1) like the attribute(s), 2) dislike the attribute(s)7 , 3) accept the
recommendations, or 4) reject the recommendations. Only if the user accepts the recommendations,
the session ends successfully; In other cases, the session continues and the system needs to adjust
the candidate pools, and possibly, the action policy. In terms of the policy for making actions,
there are three questions to consider: (1) what attributes to ask, (2) what items to recommend, and
(3) whether to ask or recommend in a turn. Instead of addressing each question with a separate
component [27, 44], our method seamlessly addresses them in a holistic framework.

3.2 Thompson Sampling


As our method is inspired by Thompson Sampling, we here briefly introduce it. Thompson Sam-
pling [1, 14, 15, 31, 33] is a typical class of bandit algorithms [41] which balance the exploration
and exploitation during online learning. Here, the exploitation means taking actions according
to the model’s current estimation of the user’s interest, and exploration means exploring a user’s
unknown preference to pursue a possibly higher reward in the future. Differently from bandit
algorithms that use the upper confidence bound (UCB) [4, 29] to model the uncertainty of the
learner’s estimation on a user’s preference, Thompson Sampling balances EE by sampling from a
7 More comprehensive than [27], here we explicitly model user negative feedback on attributes.

ACM Transactions on Information Systems, Vol. 39, No. 4, Article 40. Publication date: August 2021.
Seamlessly Unifying Attributes and Items: Conversational Recommendation for Cold-Start Users 40:7

posterior distribution of the reward estimation, which is shown to be a more effective and robust
method in many situations according to some previous research works [6, 40].
In the typical setting of contextual bandit problem, there is an arm pool A and a set of contexts
X. In an interaction turn, the model needs to select an arm 𝑎 ⊆ A to play given a context x ⊆ X,
and then gets a reward 𝑟 𝑎 from the users accordingly. The goal of the arm choosing is to maximize
the accumulated reward given the limited interaction turns. In the interactive recommendation
scenario, an arm 𝑎 can represent an item and the context x can represent the accessible information
for the recommendation decision-making, such as the arm (i.e., item) embeddings. In a turn, the
recommender chooses an item to display to the user (i.e., play an arm) and gets user feedback
as the reward, either by accepting (positive reward) or rejecting (negative reward). In the linear
contextual bandit setting, the reward 𝑟 𝑎 for an arm 𝑎 under context x𝑎 is assumed to be generated
from an (unknown) distribution with mean uT x𝑎 , where u ∈ R𝑑 is a fixed but unknown parameter.
In the recommendation scenario, this unknown variable u can be considered as an embedding that
reflects a users’ preference.
In Thompson sampling, with a prior distribution 𝑃 (u) on the unknown parameter u and a col-
lection of observed triples D𝑡 = {(x1 ; 𝑎 1 ; 𝑟 𝑎 (1) ), (x2 ; 𝑎 2 ; 𝑟 𝑎 (2) ), ...(x𝑡 −1 ; 𝑎𝑡 −1 ; 𝑟 𝑎 (𝑡 −1) ) } in the current
session at time 𝑡, a posterior estimation 𝑃 (u) of the user’s preference can be derived according to
the Bayes rule:
𝑃 (u|D𝑡 ) ∝ 𝑃 (D𝑡 |u)𝑃 (u), (1)
where 𝑃 (D𝑡 |u) is the likelihood function.
Thompson sampling samples ũ from 𝑃 (u|D𝑡 ) to get instantiated parameters in the turn, based on
which, the reward for each arm 𝑎 is calculated as 𝑟˜𝑎 = ũT x𝑎 . An arm is chosen to play according to
the reward calculated from the sampled parameter arg max𝑎 ũT x𝑎 . After getting the user’s feedback,
we have a new triple added into D𝑡 . In the process, sampling and updating are the key mechanisms
to ensure the EE balance. We next use an example to illustrate it.
A commonly used likelihood function 𝑃 (D𝑡 |u) and prior 𝑃 (u) are Gaussian likelihood function
and Gaussian prior [1], respectively. Contextual Thompson Sampling [1] naturally assumes the
embedding for user 𝑢 also follows a Gaussian distribution, which can be written as N (𝝁𝑢 , 𝑙 2 B𝑢−1 ).
The mean 𝝁𝑢 denotes the estimated expectation of user embedding u and the covariance matrix
𝑙 2 B𝑢−1 denotes the uncertainty about the embedding (note that B𝑢 is the inverse covariance matrix),
and 𝑙 is a hyper-parameter to control the uncertainty. After the user gives the feedback (reward) 𝑟 𝑎
on arm 𝑎, the model updates the parameters as:
B𝑢 = B𝑢 + x𝑎 (𝑡 ) x𝑎 (𝑡 ) 𝑇 (2)
f𝑢 = f𝑢 + 𝑟 𝑎 ∗ x𝑎 (𝑡 ) (3)
𝝁𝑢 = B𝑢−1 f𝑢 , (4)
where x𝑎 (𝑡 ) is the embedding of the chosen arm 𝑎(𝑡) at turn 𝑡, B𝑢 is initialized as the identity
matrix and f𝑢 as a zero vector. Then contextual Thompson Sampling will sample from the posterior
distribution N (𝝁𝑢 , 𝑙 2 B𝑢−1 ) to get an estimation of user embedding u under uncertainty.
It will keep updating N (𝝁𝑢 , 𝑙 2 B𝑢−1 ) during the interaction process. On the one hand, this posterior
update incorporates past experience into the reward estimation; on the other hand, it reflects the
model’s uncertainty on arms with different context. For example, if the user gives a negative
feedback to an arm 𝑎, contextual Thompson Sampling tends to score lower for 𝑎 and other arms of
the similar embedding with 𝑎 by updating the mean 𝝁𝑢 . And whenever the user gives feedback to
an arm, contextual Thompson Sampling will update B𝑢 to discourage further exploration of this
arm because we have already known the user’s preference on it. More details about the strategy of
update is given in Section 7.

ACM Transactions on Information Systems, Vol. 39, No. 4, Article 40. Publication date: August 2021.
40:8 Li and Lei, et al.

4 METHOD
As discussed in Section 1, the key challenges in conversational recommendation for cold-start
scenario lie on two aspects: (1) the strategy to decide whether to ask or recommend; and (2) the
balance of EE. To tackle the two challenges, we propose to unify the attributes and items seamlessly
into the one arm space, resulting in our ConTS model. Specifically, it comprises three parts, namely
1) Initialization and Sampling, 2) Arm Choosing, and 3) Updating. We will detail them after a model
overview in this section.

4.1 Overview
Figure 2 shows the workflow of our ConTS in a conversation session. We overview the model by
going through the major components (the numbers in text description correspond to the numbers
in Figure 2).
Initialization and Sampling: If it is the first turn in the session, we need to initialize the distri-
bution of user embedding by averaging the pre-trained embedding of existing users (step (1)~(2)).
Then, the conversation cycle starts from sampling an instantiated embedding for the user from
the currently estimated distribution (step (3)). The sampling step exactly follows the contextual
Thompson Sampling [1], leveraging its advantage to achieve EE balance.
Arm Choosing: ConTS decides an action (asking or recommending) by choosing an arm with the
goal to maximize the reward (step (4)~(5)). Here, we make a key contribution to unify the attributes
𝑝 ⊆ P and items 𝑣 ⊆ V as the undifferentiated arms in the same space (the whole arm pool is
A = P ∪ V). Our ConTS only needs to choose arms based on a unified reward estimation function
which captures the mutual promotion of attributes and items. As such, the conversation policy
questions – what items to recommend, what attributes to ask, and whether to ask or recommend in
a turn – are reduced to a single question of arm choosing in ConTS. If the chosen arm is an item, a
CRS recommends the top 𝑘 items; and if the arm is an attribute, a CRS asks the attribute (or several
top attributes, depending on the problem setting, c.f. Section 5.1) (step (6)).
We argue this is the key difference between our ConTS and ConUCB [54]. In ConUCB, the items
and attributes are separately modelled in two different arm spaces. The problems of what item
to recommend and what attribute to ask are modeled independently where each requires specific
efforts and the question of whether to ask or recommend in a turn is addressed by an ad-hoc rule
(e.g., asking attributes each five turns).
Updating: After playing an arm (recommend item or ask attribute), the system will receive feedback
from the user (step (7)). It firstly updates the current arm pool according to the feedback (e.g.,
removing items not containing the user’s preferred attributes, see Section 4.4). ConTS then updates
the distribution of user embedding based on our unified reward estimation function and the user’s
feedback (step (8)). We will detail it in Section 4.4.

4.2 Initialization and Sampling


Initialization with offline FM. As the standard step in contextual Thompson Sampling, we
need to get the arm embedding. We train an offline Factorization Machine (FM) model following
EAR [27] on records of existing users to simultaneously get the embeddings of all attributes, items
and existing users. Specifically, we train the FM by Bayesian Personalized Ranking (BPR) [39],
aiming to make it rank user’s preferred items and attributes higher than the others. We do muti-task
training, jointly training on the two task of item prediction and attribute prediction. Specifically,
we first train the model for item prediction. After it converges, we continue to optimize the FM for
attribute prediction. The embeddings of all attributes and items are in the same embedding space,
which will be treated as arm embeddings later in Section 4.3.

ACM Transactions on Information Systems, Vol. 39, No. 4, Article 40. Publication date: August 2021.
Seamlessly Unifying Attributes and Items: Conversational Recommendation for Cold-Start Users 40:9

Fig. 2. The example of ConTS’ actions during a whole conversational recommendation process for cold-start
users. Orange blocks represent operations, green blocks represent parameters and blue blocks resemble agents
in the scenario.

Online posterior sampling. ConTS follows the contextual Thompson Sampling described in
Section 3.2, by assuming the embedding for a new user 𝑢 follows a multidimensional Gaussian
distribution N (𝝁𝑢 , 𝑙 2 B𝑢−1 ). Conventionally, B𝑢 and 𝝁𝑢 are initialized with identity matrix and zero
vector separately. However, we hypothesize that the historical action of existing users can help to
estimate the preference of new users. Without any previous information, it is reasonable to assume
that the preference of new users to be the average of exsing users’. Thus, we initialize 𝝁𝑢 as the
average embedding of existing users while following the convention to initialize B𝑢 as identity
matrix. Specifically, if U𝑜𝑙𝑑 denotes the collection of all embeddings of existing users, then,
𝑁
1 ∑︁
u𝑖𝑛𝑖𝑡 = u𝑖 , u𝑖 ∈ U𝑜𝑙𝑑 . (5)
𝑁 𝑖=1
Correspondingly, the intermediate variable f𝑢 is also initialized with u𝑖𝑛𝑖𝑡 since f𝑢 is updated by
Eq 14 (i.e., 𝝁𝑢 = B𝑢−1 f𝑢 , where B𝑢 is initialized by identity matrix).
After initialization, ConTS starts a MCR session to interact with the user. At the beginning of
each turn 𝑡 = 1, 2, ...,𝑇 in a MCR session, ConTS samples from N (𝝁𝑢 , 𝑙 2 B𝑢−1 ) to get user embedding
ũ8 . The sampling is the key step to achieve EE balance. On one hand, ConTS uses the mean 𝝁𝑢 to
control the expectation of the sampling result to exploit user’s currently known preference. On
the other hand, ConTS uses covariance 𝑙 2 B𝑢−1 to model the uncertainty about the estimated user
preference, which decides how ConTS explore user’s latent unknown preference.

4.3 Arm Choosing


Once obtaining user embedding by posterior sampling, the agent needs to take an action, either by
asking an attribute or recommending items. This is much more complex than existing contextual
8 Note that both 𝝁𝑢 and 𝑙 2 B𝑢−1 are changed in each turn according to user feedback (see Section 4.4).

ACM Transactions on Information Systems, Vol. 39, No. 4, Article 40. Publication date: August 2021.
40:10 Li and Lei, et al.

Algorithm 1 Conversational Thompson Sampling (ConTS)


Input: user 𝑢, the set of all attributes P, the set of all items V, the number of items to recommend
𝑘, maximum turn number 𝑇 , hyper parameter 𝑙;
Output: Recommendation result: success or fail;
1: Initialization: for a new user 𝑢, do the following initialization:
2: A𝑢 = P ∪ V; u𝑖𝑛𝑖𝑡 = 𝑁1
Í𝑁 𝑜𝑙𝑑 ]; 𝝁 = f = u
𝑖=1 [u𝑖 |𝑢𝑖 ∈ U 𝑢 𝑢 𝑖𝑛𝑖𝑡 ; B𝑢 = I𝑑 ; P𝑢 =∅;
3: for Conversation turn 𝑡 = 1, 2, 3...𝑇 do
4: Sample ũ from the distribution: ũ ∼ N (𝝁𝑢 , 𝑙 2 B𝑢−1 )
Play arm 𝑎(𝑡) = 𝑎𝑟𝑔𝑚𝑎𝑥𝑎 ⊂ A𝑢 u˜T x𝑎 + 𝑝𝑖 ∈ P𝑢 x𝑎𝑇 p𝑖
Í
5:
6: if 𝑎(𝑡) ∈ P then
7: ask 𝑢’s preference on 𝑎(𝑡)
8: get 𝑢’s feedback 𝑟 𝑎
9: if 𝑢 accepts 𝑎(𝑡) then
10: P𝑢 = P𝑢 ∪ 𝑎(𝑡)
11: A𝑢 = {𝑎 |𝑎 ∈ (P \ P𝑢 ) 𝑜𝑟 (𝑎 ∈ V & 𝑎(𝑡) ∈ P𝑢 )}
12: else: A𝑢 = A𝑢 \ 𝑎(𝑡)
13: else 𝑎(𝑡) ⊂ V
14: recommend top k items V𝑘 from ranked item pool V𝑟𝑎𝑛𝑘
15: get 𝑢’s feedback 𝑟
16: if 𝑢 accepts V𝑘 then
17: Recommendation succeeds, Quit;
18: else User rejects V𝑘
19: Update: A𝑢 = A𝑢 \ V𝑘
20: Update Bandit parameters in sequence by:
21: B𝑢 = B𝑢 + x𝑎 (𝑡 ) x𝑎 (𝑡 ) 𝑇
𝑟 𝑎′ = 𝑟 𝑎 − x𝑎 (𝑡 ) 𝑇 (u𝑖𝑛𝑖𝑡 + 𝑝𝑖 ∈ P𝑢 p𝑖 )
Í
22:
23: f𝑢 = f𝑢 + 𝑟 𝑎′ ∗ x𝑎 (𝑡 )
24: 𝝁𝑢 = B𝑢−1 f𝑢
25: if 𝑡 = 𝑇 then
26: Recommendation fails, Quit;

Thompson Sampling methods where the action is simply to choose an item to recommend. In the
MCR scenario, a CRS needs to consider more questions: 1) what attributes to ask, 2) what items to
recommend, and 3) whether to ask or recommend in a turn. To address those problems, ConTS
adopts a simple but efficient strategy to model all the items and attributes as undifferentiated arms
in the same arm space. As such, the aforementioned three problems in the MCR scenario reduced
to the problem of arm choosing, which can be nicely solved by the maximum rewards approach in
contextual Thompson Sampling. Specially, we assume the expected reward of arm 𝑎 of user 𝑢 is
generated by:

𝑟 (𝑎, 𝑢, P𝑢 ) ∼ N (uT x𝑎 +
∑︁
x𝑎𝑇 p𝑖 , 𝑙 2 ), (6)
𝑝𝑖 ∈ P𝑢
where 𝑙 is the standard derivation of the Gaussian noise in the reward observation. u and x𝑎
represent the embedding of the user and arm, and P𝑢 denotes user’s currently known preferred
attributes obtained from the conversation by asking attributes. It explicitly takes the attributes
confirmed by the user during a conversation as a strong indicator for user preference. The first

ACM Transactions on Information Systems, Vol. 39, No. 4, Article 40. Publication date: August 2021.
Seamlessly Unifying Attributes and Items: Conversational Recommendation for Cold-Start Users 40:11

part uT x𝑎 models the general preference of user 𝑢 to arm 𝑎, and the second part 𝑝𝑖 ∈ P𝑢 x𝑎𝑇 p𝑖
Í
models the affinity between arm 𝑎 and the user’s preferred attributes P𝑢 . This formula is different
from the linear reward generation assumption in contextual Thompson Sampling in the sense that
it explicitly capture the mutual promotion of items and attributes. We intentionally design this
formula to exactly follow EAR [27]. This is because we want to establish a fair comparison between
our ConTS and EAR to demonstrate the effectiveness of EE balance of the Thompson Sampling
framework in the conversational recommendation for cold-start users (See Section 5).
Based on Eq. 6, the system will choose the arm with the maximal reward to play. That is to
say, if the chosen arm represents an attribute, the model will ask the user’s preference about the
attribute (or several top attributes). However, if the chosen arm is an item, the model will directly
recommend the top 𝑘 items with max rewards to the user. The arm-choosing policy seamlessly
unifies the asking attributes and recommending items together in an holistic framework, while
ConUCB [54] separately models attributes and items, and decides whether to ask or recommend
based on a hand-crafted heuristics.

4.4 Updating
After asking a question or recommending items, the agent will receive feedback from the user
and the feedback will be used to update our model. Firstly, ConTS renews the arm pool in the
current conversation. If the user gives a negative feedback to either attribute(s) or items, we remove
them from the current arm pool A𝑢 . If he likes the attribute, it is added into the user’s preference
attribute pool P𝑢 and the agent filters out items without the preferred attribute from candidate
pool. And if the user accepts the recommended item, the session successfully ended according to
the MCR scenario.
Different from the typical update methods in contextual Thompson Sampling described in
Section 3.2 in Eq. (12), ConTS is adapted to conversational recommendation scenario for cold-start
Í
users by considering two additional factors: u𝑖𝑛𝑖𝑡 for user embedding initialization and 𝑝𝑖 ∈ P𝑢 for
reward estimation. In fact, Eq. (6) can be re-written as:

𝑟 (𝑎, 𝑢, P𝑢 ) ∼ N (uT x𝑎 (𝑡 ) +
∑︁ ∑︁
x𝑎 (𝑡 ) 𝑇 p𝑖 , 𝑙 2 ) = N x𝑎 (𝑡 ) 𝑇 (u𝑜𝑟𝑖𝑔 + u𝑖𝑛𝑖𝑡 + p𝑖 ), 𝑙 2 ,

(7)
𝑝𝑖 ∈ P𝑢 𝑝𝑖 ∈ P𝑢

where u𝑜𝑟𝑖𝑔 denotes the user embedding exactly in the original contextual Thompson Sampling [1],
in which u𝑜𝑟𝑖𝑔 is initialized with zero vector, and 𝑟 (𝑎, 𝑢, P𝑢 ) is our observed reward according to
user’s feedback. During the whole process, u𝑜𝑟𝑖𝑔 is the only variable to be updated while u𝑖𝑛𝑖𝑡 and
Í
𝑝𝑖 ∈ P𝑢 p𝑖 are given, which can be regarded as the bias used in predicting the real reward from the
original user embedding. According to Eq. (7), we have x𝑎 (𝑡 ) 𝑇 u𝑜𝑟𝑖𝑔 = E[𝑟 (𝑎, 𝑢, P𝑢 )] − x𝑎 (𝑡 ) 𝑇 (u𝑖𝑛𝑖𝑡 +
Í
𝑝𝑖 ∈ P𝑢 p𝑖 ). Intuitively speaking, the reward generated by the user embedding 𝑢𝑜𝑟𝑖𝑔 needs to be
de-biased from the observed reward 𝑟 𝑎 by removing the effect of the two bias terms. This intuition
can also be reflected in the derivation of the user embedding’s posterior Í distribution as shown
in Eq. (8). To simplify the notations, we define 𝑟 𝑎′ = 𝑟 𝑎 − x𝑎 (𝑡 ) 𝑇 (u𝑖𝑛𝑖𝑡 + 𝑝𝑖 ∈ P𝑢 p𝑖 ), Σ = 𝑙 2 B𝑢−1 , and

ACM Transactions on Information Systems, Vol. 39, No. 4, Article 40. Publication date: August 2021.
40:12 Li and Lei, et al.

simplify u𝑜𝑟𝑖𝑔 as u𝑜 .
𝑃 (u𝑜 |(x,𝑎, 𝑟 𝑎 )) ∝ 𝑃 ((x, 𝑎, 𝑟 𝑎 )|u𝑜 )𝑃 (u𝑜 ) (8)
= N (u𝑜T x𝑎 + u𝑖𝑛𝑖𝑡
T
∑︁
x𝑎 + x𝑎𝑇 p𝑖 , 𝑙 2 )N (𝝁𝑢 , Σ)
𝑝𝑖 ∈ P𝑢

(𝑟 𝑎 − u𝑜T x𝑎 − u𝑖𝑛𝑖𝑡
T x − x𝑎𝑇 p𝑖 ) 2 (u𝑜 − 𝝁𝑢 ) T Σ−1 (u𝑜 − 𝝁𝑢 )
Í
det(Σ) −1/2 𝑎 𝑝𝑖 ∈ P𝑢
= √︁ exp(− ) exp(− )
(2𝜋)𝑑+1𝜎 2 2𝑙 2 2
det(Σ) −1/2 (𝑟 ′ − u𝑜T x𝑎 ) 2 (u𝑜 − 𝝁𝑢 ) T Σ−1 (u𝑜 − 𝝁𝑢 )
= √︁ exp(− 𝑎 ) exp(− )
(2𝜋)𝑑+1𝜎 2 2𝑙 2 2
T 𝑟 𝑎′ 2
det(Σ) −1/2 u𝑜 (Σ −1 + 1
x xT )
𝑙2 𝑎 𝑎
− 2u𝑜T (Σ−1 𝝁𝑢 + 1
x 𝑟 ′)
𝑙2 𝑎 𝑎
+ 𝝁𝑢T Σ−1 𝝁𝑢 + 𝑙2
= √︁ exp( )
(2𝜋)𝑑+1𝜎 2 −2

Since Gaussian distribution has conjugate prior, the posterior 𝑃 (u𝑜 |(x, 𝑎, 𝑟 𝑎 )) still follows the
Gaussian distribution, i.e., it can be written as 𝑃 (u𝑜 |(x, 𝑎, 𝑟 𝑎 )) = N (𝜇𝑜,post, 𝑙 2 Σ𝑜,post ), in which u𝑜,post
and 𝑙 2 Σ𝑜,post are the posterior mean and covariance matrix respectively. According to Eq (8), it is
easy to obtain,
1 1
Σ𝑜,post
−1
= Σ−1 + 2 x𝑎 x𝑎T = 2 (B𝑢 + x𝑎 x𝑎T ) (9)
𝑙 𝑙
1 1 1
u𝑜,post = Σ𝑜,post (Σ−1 𝝁 + 2 x𝑎 𝑟 𝑎′ ) = 2 Σ𝑜,post (B𝑢 𝝁𝑢 + x𝑎 𝑟 𝑎′ ) = 2 Σ𝑜,post (f𝑢 + x𝑎 𝑟 𝑎′ ) (10)
𝑙 𝑙 𝑙
According to the derived posterior mean and covariance matrix in Eq. (9) and Eq. (10), ConTS
updates parameters of the distribution for the user’s embedding as follows:
B𝑢 = B𝑢 + x𝑎 (𝑡 ) x𝑎 (𝑡 ) 𝑇 (11)
∑︁
𝑟 𝑎′ = 𝑟 𝑎 − x𝑎 (𝑡 ) 𝑇 (u𝑖𝑛𝑖𝑡 + p𝑖 ) (12)
𝑝𝑖 ∈ P𝑢
f𝑢 = f𝑢 + 𝑟 𝑎′ ∗ x𝑎 (𝑡 ) (13)
𝝁𝑢 = B𝑢−1 f𝑢 . (14)

4.5 Complexity Analysis


Lastly, we briefly analyse the complexity of ConTS. If 𝑛 = |V𝑡 | denotes the length of current
candidate item pool, 𝑘 = |P𝑡 | denotes the length of the current candidate attribute pool, and 𝑑
denotes the length of arm’s embedding, the complexity of ConTS is 𝑂 (2 ∗ 𝑑 ∗ (𝑛 + 𝑘) + 𝑑 2 ). The first
part 𝑂 (2 ∗ 𝑑 ∗ (𝑛 + 𝑘) is the complexity of ranking all items and attributes (usually 𝑛 >> 𝑘), which
is general in most of traditional statistic recommendation models such as matrix factorization [25]
and factorization machines [38]. The second part 𝑂 (𝑑 2 ) is the complexity of calculating the inverse
of matrix of user embedding, which is a typical part of complexity for contextual Thompson
Sampling algorithms [1, 29]. Note that the pre-trail computational complexity of matrix inverse can
be reduced to 𝑂 (𝑑 2 ) according to [29]. Since ConTS is based on contextual Thompson Sampling, the
detailed analysis of complexity of ConTS is the same as contextual contextual Thompson Sampling.
Therefore, we directly give the result as readers can easily find the detailed analysis on the classic
literature about contextual Thompson Sampling [1, 29]. To sum up, our ConTS does not involve
any higher-order complexities compared to standard static recommendation methods as well as
Contextual Thompson Sampling methods, being practicable in real applications.

ACM Transactions on Information Systems, Vol. 39, No. 4, Article 40. Publication date: August 2021.
Seamlessly Unifying Attributes and Items: Conversational Recommendation for Cold-Start Users 40:13

5 EXPERIMENTS
In this section, we conduct experiments to compare ConTS with representative conversational
recommendation systems and the three variants of ConTS, each with one key component dropped
in the original design. We do not only examine the overall performances, but also analyze the key
mechanisms designed in ConTS e.g., keeping EE balance, seamlessly unifying items and attributes,
etc. Moreover, we design experiments to discuss ConTS’s advantages over ConUCB and compare
the performance of Thompson Sampling and UCB in our setting. Furthermore, we explore the
performance of all methods on smaller max conversation turns. Lastly, we conduct experiments to
study the impact of different cold-start degrees and do a case study for comparison.

5.1 Datasets
We perform experiments on three real-world datasets which vary significantly in domains and
settings. The settings include enumerated questions on Yelp, binary questions on LastFM and
multi-attribute questions on Kuaishou (details are shown in the following). We compare ConTS
with other methods on the three different settings, aiming to evaluate the model’s generalization in
various conversational recommendation scenarios.
• Yelp9 : This is a dateset for business recommendation. We use the version processed by [27],
which contains a two-level taxonomy on the attributes in the original Yelp data10 . For example,
the original attributes "wine", "beer", "whiskey" all belong to the parent attribute "alcohol".
There are 29 such parent attributes over the 590 original attributes. The agent will choose
from these 29 attributes to ask during a conversation. Meanwhile, all the child attributes will
be displayed to the user for himself to choose the preferred one. Similar to EAR [27], this
setting is denoted as enumerated question. This step is to avoid overly lengthy conversation
caused by a large attribute space.
• LastFM9 : This dataset is for music artist recommendation. We also follow exactly the pre-
processing by [27]. Different from the setting in Yelp dataset, this dataset is designed for
binary questions. This means the that the system chooses one of the 33 attributes and ask
the user to indicate the binary feedback (like or dislike) on the asked attribute.
• Kuaishou11 : We construct a new dataset built on the video-click records of cold-start users
in Kuaishou app. As shown in Figure 1(b), the app will show an interface to the user and ask
for his preference on multiple attributes instead of a single one. Following this scenario, each
time when the agent decides to ask for preference on attributes, our ConTS chooses the top
12 attributes according the estimated rewards. We still ask the user to give binary feedback
to each attribute. It is also worth mentioning that we follow [27] to filter out low-frequency
data, including the users with less than 10 view records and the attributes with less than 5
occurrences. This is to make sure we can simulate more than 10 conversations for each user
such that our evaluation is more robust.
5.1.1 Dataset Partition. The statistics of the three datasets are summarized in Table 2. For each
dataset, we partition it into a training set, a testing set and a validation set. As discussed in Section 4.3,
the training set is used to train the embedding of existing users, items and attributes offline. And
the testing set is utilized for the conversation simulation of cold-start users which is detailed in
Section 5.2.2. The validation set is used for parameter tuning.
To simulate the cold-start setting, the training set, testing sets and validation set are user-disjoint.
This means that, for any user in one set, we cannot find any corresponding user-item interaction
9 https://drive.google.com/open?id=13WcWe9JthbiSjcGeCWSTB6Ir7O6riiYy
10 https://www.yelp.com/dataset/
11 https://github.com/xiwenchao/Kuaishou-data/

ACM Transactions on Information Systems, Vol. 39, No. 4, Article 40. Publication date: August 2021.
40:14 Li and Lei, et al.

Algorithm 2 Partition of Datasets


Input: collection of all users 𝑈 , collection of all user-item interaction records 𝐼 , collection of all
interaction records for any user 𝐼𝑢 where 𝑢 ∈ 𝑈 ;
Output: training set 𝐼𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 , testing set 𝐼𝑡𝑒𝑠𝑡𝑖𝑛𝑔 , validation set 𝐼 𝑣𝑎𝑙𝑖𝑑𝑎𝑡𝑖𝑜𝑛 ;
1: Initialization: the training, testing and validation set 𝐼𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 = ∅, 𝐼𝑡𝑒𝑠𝑡𝑖𝑛𝑔 = ∅, 𝐼 𝑣𝑎𝑙𝑖𝑑𝑎𝑡𝑖𝑜𝑛 = ∅;
2: loop Randomly sample a user 𝑢 from 𝑈 :
3: if |𝐼𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 | ≤ 70% ∗ |𝐼 | (| | denotes the size of the set) then
4: 𝐼𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 = 𝐼𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 ∪ 𝐼𝑢
5: continue
6: if |𝐼𝑡𝑒𝑠𝑡𝑖𝑛𝑔 | ≤ 15% ∗ |𝐼 | then
7: 𝐼𝑡𝑒𝑠𝑡𝑖𝑛𝑔 = 𝐼𝑡𝑒𝑠𝑡𝑖𝑛𝑔 ∪ 𝐼𝑢
8: continue
9: 𝐼 𝑣𝑎𝑙𝑖𝑑𝑎𝑡𝑖𝑜𝑛 = 𝐼 \ (𝐼𝑡𝑒𝑠𝑡𝑖𝑛𝑔 ∪ 𝐼𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 )
10: End sampling, quit

records in the other two sets. As such, we treat the users in the training set as existing users while
the users in validation and testing set as new users as their interaction records do not appear in the
training set. Note that the partition is based on the proportion of interaction records. Specifically,
we use Algorithm 2 to generate such partition so that the user-item interaction records in training
set take up the 70% of the all records in the whole dataset while the testing and validation set each
take up 15% respectively. We choose this partition strategy to make our offline training of FM
model to be on par with [27] where the FM model is trained using 70% of the user-item interaction
records. The statistics of the training set, testing set and validation set of the three datasets are
given in Table 3.

Table 2. Statistics of datasets after preprocessing.

Dataset users items interaction records attributes


Yelp 27,675 70,311 1,345,606 590
LastFM 1,801 7,432 76,693 33
Kuaishou 6,861 18,612 100,699 665

Table 3. Statistics of the training set, testing set and validation set.

Dataset users interaction records


Training Testing Validation Training Testing Validation
Yelp 21,309 3,205 3,161 941,928 201,845 201,833
LastFM 1,260 279 262 53,636 11,532 11,525
Kuaishou 4,537 1,175 1,149 70,492 15,110 14,097

5.2 Experimental Setting


5.2.1 Baselines. While many good conversational recommendation systems such as [2, 8, 10, 35, 36,
42, 44, 52, 53] have been proposed recently, most of them cannot tackle the cold-start scenarios in
MCR, hence being incomparable with our ConTS. We try our best to select following representative
models as our baselines:

ACM Transactions on Information Systems, Vol. 39, No. 4, Article 40. Publication date: August 2021.
Seamlessly Unifying Attributes and Items: Conversational Recommendation for Cold-Start Users 40:15

• Abs Greedy [11]: This work trains an offline FM model based on existing users’ records. Then
it initializes the embedding for the new user with the average of existing users’ embeddings.
Each time it will recommend the top 𝑘 items ranked by FM and update the FM according
to the rejected item list. But it does not consider any attribute information and recommend
without asking any questions. Christakopoulou et al. [11] have shown that Abs-Greedy has
a better performance than classic Thompson Sampling, so we don’t take classic Thompson
Sampling as a baseline here.
• EAR [27]: EAR is a state-of-the-art model on MCR scenario for warm-start users. It trains a
policy network by reinforcement learning to decide whether to ask questions or recommend
items. The input of the network is an vector which encodes the entropy information of each
attribute, user’s preference on each attribute, conversation history as well as length of the
current candidate list. The output is a probability distribution over all actions, which includes
all attributes and a dedicated action for recommendation. It uses an pretrained offline FM
model to decide which items to recommend and employs a policy network trained by policy
gradient to decide conversational policy. To adapt EAR to our setting, we use the average of
existing users’ embedding as the embedding of the new user and use the existing user set for
offline training. The new user set is used as testing. The rest of the configurations are exactly
the same as the original EAR implements.
• ConUCB [54]: Conversational UCB is a recently proposed method to apply bandit to con-
versational recommendation scenario. It models the attributes and items as different arms
and choose them separately by different ranking score. As for the attribute choice policy,
the authors put forward four methods in the paper to choose the attribute to ask: Random,
Maximal Confidence, Maximal Confidence Reduction and Advanced Strategy. If 𝑛 = |V𝑡 |
denotes the length of current candidate item pool, 𝑘 = |P𝑡 | denotes the length of the current
candidate attribute pool, and 𝑑 denotes the length of item’s embedding, Maximal Confidence
Reduction is an 𝑂 (𝑛 2 + 2 ∗ 𝑑 2 ) algorithm and Advanced Strategy is an 𝑂 (2 ∗ 𝑛 ∗ 𝑑 ∗ 𝑘 + 2 ∗ 𝑑 2 )
algorithm, while our ConTS is 𝑂 (2 ∗ 𝑑 ∗ (𝑛 + 𝑘) + 𝑑 2 ) (note that 𝑛 >> (𝑑 + 𝑘) in our setting).
We choose the second method "Maximal Confidence" proposed in the work. The third and
the fourth ones in the paper are infeasible in our setting, because a large arm pool in our
corpus will result in a gigantic computing complexity. Moreover, all the parameters are tuned
following the methods in original paper on validation set. The optimal performance can be
achieved when 𝜆 = 0.8 and 𝜆˜ = 1. We also explore different policy functions in ConUCB and
report the best result, which is discussed in details in Section 5.4.
To validate the key component of our ConTS design, we also compare ConTS with the following
variants, each with one component ablated:
• ConTS-u𝑖𝑛𝑖𝑡 : As discussed in Section 4.2, ConTS initializes the mean of new user’s preference
distribution with the average of existing user embeddings, which is denoted as u𝑖𝑛𝑖𝑡 . We
probe its impact on ConTS by removing it from the model, where we follow Agrawal and
Goyal [1] to initialize the mean with a zero vector.
Í
• ConTS-P𝑢 : ConTS models user’s currently known preferred attributes 𝑝𝑖 ∈ P𝑢 when calcu-
lating the respected reward for each arm. This is to capture the mutual promotion of items
and attributes since they are in the same space. We design experiments to discuss its effect on
ConTS by removing P𝑢 from the reward estimation function (Eq. (6) in Section 4.3). As such,
the reward estimation exactly follows the contextual Thompson Sampling [1]. The resultant
system is denoted as ConTS-P𝑢 .
• ConTS-exp: ConTS inherits the sampling mechanism from contextual Thompson Sampling
to explore the user’s latent unknown preferences. We conduct experiments to investigate

ACM Transactions on Information Systems, Vol. 39, No. 4, Article 40. Publication date: August 2021.
40:16 Li and Lei, et al.

the effectiveness of such exploration in ConTS. Specifically, we remove the sampling step
and just take the mean of the distribution as user embedding in each turn (we also keep
the parameter B𝑢 as identity matrix), aiming to learn the performance of ConTS without
exploration.
5.2.2 User Simulator. As conversational recommendation is an interactive process, a CRS needs to
be trained and evaluated by interacting with users. However, it is expensive and less reproducible
to directly recruit human beings to interact with the CRSs. Therefore, we follow the approach of Lei
et al. [27], Sun and Zhang [44] to build a user simulator to simulate a user’s feedback based on the
user’s historical user-item interaction records. Specifically, if there is an user-item record (𝑢, 𝑣), we
treat 𝑣 as the user’s preferred item and its corresponding attributes P𝑣 as his preferred attributes in
this session. We simulate a conversation session for each interaction record and the user will give
positive feedback only to item 𝑣 and attributes P𝑣 in current session. We also follow [27] to set
the max turn 𝑇 to 15 in our primary experiments but we will also discuss the model performances
when setting it to other numbers in Section 5.6. If the user accept the items recommended by
CRS, the conversation will successfully end. If the recommendation is still not successful when
achieving the number of max turns, the simulated user will terminate the session due to the lack of
patience. Note that we ask enumerated questions on Yelp. If ever user rejects a parent attribute,
ConTS will naturally update the algorithm by all the child attributes together. Lastly, it is worth
mentioning that we initialize the distribution of embedding for a new user in the first conversation
with him. After the first conversation finishes, his parameters are kept and continue to be updated
in following conversations.
5.2.3 Evaluation Metrics. As what has been done in [11], we use metrics in conversational rec-
ommendation for evaluation. Following Christakopoulou et al. [11], Lei et al. [27], we use two
metrics to measure the performance of each model. First we use the success rate (SR@t) [44] to
measure the ratio of successful conversation, i.e., successfully make recommendation by turn 𝑡 12 .
#𝑠𝑢𝑐𝑐𝑒𝑠𝑠 𝑓 𝑢𝑙 𝑐𝑜𝑛𝑣𝑒𝑟𝑠𝑎𝑡𝑖𝑜𝑛𝑠 𝑏𝑦 𝑡𝑢𝑟𝑛 𝑡
It can be formulated as 𝑆𝑅@𝑡 = #𝑐𝑜𝑛𝑣𝑒𝑟𝑠𝑎𝑡𝑖𝑜𝑛𝑠 . We also report the average turns
(AT) needed to end the session. Note that if a conversation session has not been end at turn 𝑇 , we
simply set it to be 𝑇 . Higher SR@t represents more successful turns and smaller AT means more
efficient conversation. We use one-sample paired t-test to judge the statistical significance.
5.2.4 Implementation Details. The working process includes offline and online stages. In the offline
stage, we use the records of all existing users (training set) to train the users, items and attributes
embedding using the FM proposed in [27]. The training objective is to rank the user’s preferred
items and attributes higher using the FM proposed in [27]. It is optimized using SGD with a
regularization strength of 0.001.
In the online stage, we run the ConTS on the testing set. Specifically, we let our ConTS interact
with the new user simulator described in Section 5.2.2. The dialogue sessions are simulated using
user–item historical records simulated in the testing set. The parameters of ConTS are tuned
on the simulated conversations from validation set. The values of user’s feedback in ConTS is
searched in the range of [-8, 8] and we apply grid search to find the optimal settings. We find
that the absolute value of positive feedback should be larger while the absolute value of negative
feedback should be smaller for better performance. Specifically, the optimal parameters are set as
follows: 𝑟 𝑓 𝑎𝑖𝑙_𝑟𝑒𝑐 = −0.15, 𝑟 𝑓 𝑎𝑖𝑙_𝑎𝑠𝑘 = −0.03, 𝑟𝑠𝑢𝑐_𝑎𝑠𝑘 = 5, 𝑟𝑠𝑢𝑐_𝑟𝑒𝑐 = 5; besides, the hyperparameters
are set as 𝑙 = 0.01 and 𝑘 = 10. Following EAR [27], the dimension 𝑑 is set to 64 for embeddings for
items, attributes and users.

12 This means the success rate with at most t turns

ACM Transactions on Information Systems, Vol. 39, No. 4, Article 40. Publication date: August 2021.
Seamlessly Unifying Attributes and Items: Conversational Recommendation for Cold-Start Users 40:17

5.3 Experiment Results

Table 4. Results of the difference on SR@15 and AT of the compared methods for Yelp, LastFM and Kuaishou.
Bold scores denote the best in each column, while the underlined scores denote the best baseline. Statistic
significant testing is performed (𝑝 < 0.01)

Yelp LastFM Kuaishou


SR@15 AT SR@15 AT SR@15 AT
Abs Greedy 0.340 14.0115 0.0104 14.9203 0.1395 14.0573
ConUCB 0.262 13.0117 0.1540 14.2279 0.1457 13.7136
EAR 0.446 11.5825 0.0640 14.3278 0.2362 13.4753
ConTS-u𝑖𝑛𝑖𝑡 0.734 8.6195 0.1514 13.9767 0.4644 12.0534
ConTS-P𝑢 0.328 12.8060 0.1505 14.0210 0.4031 12.2949
ConTS-exp 0.491 10.9975 0.0610 14.5912 0.1446 13.7879
ConTS 0.756* 8.5808* 0.1628* 13.9162* 0.4872* 11.8703*

Figure 3 and Table 4 report the overall experiment of results. Specifically, Figure 3 compares the
recommendation Success Rate (SR)@t at each turn (𝑡=1 to 15) of different models. We set ConTS
-u𝑖𝑛𝑖𝑡 (i.e, ConTS without u𝑖𝑛𝑖𝑡 ) as 𝑦 = 0 on Yelp, ConUCB as 𝑦 = 0 on LastFM and EAR as 𝑦 = 0
on Kuaishou dataset for a better view. From Figure 3, we can see that ConTS achieves a higher
success rate in nearly all turns than the other models with the exception of the first several turns
on Yelp Dataset. We find it is because the policy is a bit random at the very beginning of each
conversation session, hence the performances are not very indicative. This is especially severe on
the datasets with a larger item pool like Yelp. Table 4 reports the performance of all models on
average turn (AT) and final success rate SR@15 (we set the max turn to be 15 here). It is clear to
see that ConTS significantly outperforms all the other models on both (SR)@15 and AT. We notice
that nearly all the mentioned models perform much worse on LastFM and Kuaishou than on Yelp.
The main reason is that ConTS asks enumerated questions on Yelp (c.f. Section 5.1) where the user
chooses the child-level attributes which is finer grained (a numerated question asks more than 20
original attributes on average in a turn). This makes the candidate items shrink much faster than
in LastFM and Kuaishou corpus which adopts binary question setting. This phenomenon has also
been discussed by Lei et al. [27] in detailed.
The experiments also demonstrate the effectiveness of asking for attribute preference. We can
see that Abs-Greedy, the only model unable to ask attributes, almost has the worst success rate on
all turns. We argue that asking attributes significantly help with this scenario on two points: (1)
eliminating unqualified candidate items, i.e., the items that do not contains all the user preferred
attributes; and (2) helping better estimating user preference, i.e., estimating arm’s reward. While
the first point is obvious, the second point can be validated by comparing ConTS and ConTS-P𝑢 .
The only difference between ConTS and ConTS-P𝑢 is that ConTS models user’s known preferred
Í
attributes 𝑝𝑖 ∈ P𝑢 into calculation of arms’ reward. From Figure 3 and Table 4, we can clearly see
ConTS-P
Í 𝑢 significantly underperforms ConTS. The results suggests that calculating rewards with
𝑝𝑖 ∈ P𝑢 contributes significantly to estimating user’s preference in ConTS.
By comparing EAR with ConTS, we can see the importance of exploration for cold-start users.
EAR is the state-of-the-art model for warm start users in the MCR scenario [27]. It uses the same
function to estimate the user preference (Eq. (6) in Section 4.3). The major difference between ConTS
and EAR is that ConTS customizes contextual Thompson Sampling to make exploration-exploitation
balance while EAR uses reinforcement learning to only exploit the learned user’s preference in the

ACM Transactions on Information Systems, Vol. 39, No. 4, Article 40. Publication date: August 2021.
40:18 Li and Lei, et al.

training stage. The lack of the exploration mechanism leads to the inferior performance compared
with our ConTS. The importance of exploration can further be demonstrated by comparing ConTS
with ConTS-exp which is the variant of our ConTS by only removing the exploration mechanism
in contextual Thompson Sampling. From Figure 3 and Table 4, we can observe that ConS-exp has
worse performance than that of ConTS.
We have also studied the impact of initialization. ConTS initializes the distribution of the new
user’s embedding by setting the initial mean as the average of the embeddings of existing users
(i.e., u𝑖𝑛𝑖𝑡 in Eq.(5)). We argue this step is also useful. The superior performance of ConTS to
ConTS-u𝑖𝑛𝑖𝑡 validates this (c.f. Figure 3 and Table 4). Interestingly, we also notice the success rate
curve on Yelp in Figure 3 of ConTS-u𝑖𝑛𝑖𝑡 gradually approaches to the curve of ConTS as the number
of conversation turns increases. It is due to the updates. As the conversation evolves, the CRS
gets more information of the user preference from his feedback. Accordingly, the model gets less
benefits from the initialization. This trend is not obvious on LastFM and Kuaishou, we believe that

Yelp
0.4

0.2
Success Rate*

ConTS

-0.2 EAR

ConUCB
-0.4
Abs Greedy

ConTS-Uinit
-0.6
ConTS-Pu

ConTS-exp
-0.8
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Turn Of Conversation

(a)

LastFM
0.1

0.05

0
Success Rate*

ConTS EAR
-0.05

ConUCB Abs Greedy


-0.1

CocTS-Uinit ConTS-Pu
-0.15

ConTS-exp
-0.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Turn Of Conversation

(b)

ACM Transactions on Information Systems, Vol. 39, No. 4, Article 40. Publication date: August 2021.
Seamlessly Unifying Attributes and Items: Conversational Recommendation for Cold-Start Users 40:19

Kuaishou
0.3
ConTS EAR
0.25
ConUCB Abs Greedy
0.2 ConTS-Uinit ConTS-Pu
Success Rate*
0.15 ConTS-exp

0.1

0.05

-0.05

-0.1

-0.15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Turn Of Conversation

(c)

Fig. 3. Success Rate* (SR@t) of compared methods at different conversation turns on Yelp, LastFM and
Kuaishou. Success Rate* denotes the difference of SR between each method and the system represented by
the line of 𝑦 = 0. We set ConTS-u𝑖𝑛𝑖𝑡 as 𝑦 = 0 on Yelp, ConUCB as 𝑦 = 0 on LastFM and EAR as 𝑦 = 0 on
Kuaishou dataset to facilitate visualization.

is due to the intrinsically different question settings. The binary question adopted in the LastFM
and Kuaishou dataset helps less than the enumerated question in decreasing the candidates pool,
making the candidate pool much larger. As a result, the updates are less efficient.

5.4 Discussion: Unifying Attributes and Items


The key contribution in our ConTS is the holistic modeling: we seamlessly unify the attributes and
items in the same arm space, naturally fitting them into the framework of contextual Thompson
Sampling. The model only needs to calculate reward for each arm using a unified function (Eq.
(6)). This reduces the conversation policy questions (e.g., deciding what attributes to ask, what items
to recommend and whether to ask attributes or making recommendations) as a single problem
of arm choosing. This also helps to capture the mutual promotion between items and attributes
by conside ring both in Eq. (6). In this section, we discuss such issues by investigating ConUCB
deeply, which also takes the items and attributes as arms in the framework of bandit algorithm
but models them separately. We report the experimental results on Yelp dataset, upon which the
original ConUCB [54] is built, as they are more representative. Conclusions in this section also
apply to the rest two datasets.
Specifically, in the original paper [54] of ConUCB, the model decides when to ask attributes
based on the heuristics that a system only asks attributes if 𝑏 (𝑡) − 𝑏 (𝑡 − 1) > 0, where 𝑏 (𝑡) is
a function of the turn number in conversation. However, different function may work better in
different datasets or application scenario. This makes the model be less robust as it is hard to choose
a universal rule for all cases. For example, the author tried several
 different functions to control the
𝑡
policy, such as 5 ∗ log ⌊𝑡⌋, 10 ∗ log ⌊𝑡⌋,15 ∗ log ⌊𝑡⌋, 10 ∗ log 50 ,ect. The paper compares the model’s
performance on those different functions, but it does not give an compelling strategy about how to
choose it according to different situations. In the original paper [54], the authors make a discussion
that the model may perform better on Yelp dataset if we choose the function which encourages

ACM Transactions on Information Systems, Vol. 39, No. 4, Article 40. Publication date: August 2021.
40:20 Li and Lei, et al.

asking more questions. However, to make sure we can make successful recommendation in limited
conversation turns, we should always find out the balanced point between asking questions and
recommending items. If we follow the way in ConUCB and luckily seek out a suitable function with
good performance on the current dataset from countless candidate functions, we can never make
sure it will work well next time when we face a different recommendation situation. To investigate
the robustness of the function, we compare the performance of ConUCB by choosing different
functions on Yelp. Table 5 reports the results. We can see that, the performance fluctuates hugely
with regard to different function. Different from ConUCB, our ConTS is totally data driven – it
learns conversation policies totally based on the model which is estimated based on user feedback,
being more intelligent and portable to new application scenarios.

Table 5. SR@15 and AT of ConUCB w.r.t. different functions to decide whether to ask attributes on Yelp is
statistically significant (𝑝 < 0.01).

Policy Function 𝑏 (𝑡) SR@15 AT


5 ∗ ⌊log(𝑡)⌋ 0.117 13.8065
5 ∗ log(𝑡) 0.119 13.7725
10 ∗ log(𝑡) 0.262 13.0117
15 ∗ log(𝑡) 0.162 12.8927

We also investigate the mutual promotion of the items and attributes. In ConUCB, the model
uses two separate methods to choose the items and attributes where the two are totally different.
We probe the mutual promotion by replacing the attribute choosing function (i.e., the Maximal
Confidence) used in ConUCB with our reward estimation function13 (Eq. (6)), which is a modified
FM. Namely:

E[𝑟 (𝑎, 𝑢, P𝑢 )] = uT x𝑎 +
∑︁
x𝑎𝑇 p𝑖 . (15)
𝑝𝑖 ∈ P𝑢

Table 6. Comparing Maximal Confidence and Modified FM as attribute choosing policy in ConUCB on Yelp.
Statistic significant testing is performed (𝑝 < 0.01).

Maximal Confidence Modified FM


Policy 𝑏 (𝑡) SR@15 AT SR@15 AT
5 ∗ ⌊log(𝑡)⌋ 0.117 13.8060 0.125 13.7007
5 ∗ log(𝑡) 0.119 13.7725 0.532 10.3027
15 ∗ log(𝑡) 0.197 13.0117 0.598 9.7536

Table 6 shows that (Eq. 14) helps to improve the performance of ConUCB by a large degree.
This demonstrates the effectiveness of the mutual promotion between attributes and items, thus
further validating our idea of unifying the modeling of attributes and items in the conversational
recommendation scenario.

13 It is not portable to experiment with the item choosing function as it requires complex mathematics for adaptation.

Hence, we only experiment with the attribute choosing function in ConUCB.

ACM Transactions on Information Systems, Vol. 39, No. 4, Article 40. Publication date: August 2021.
Seamlessly Unifying Attributes and Items: Conversational Recommendation for Cold-Start Users 40:21

5.5 Discussion: Thompson Sampling v.s. UCB


Despite the difference between separately or jointly modeling item and attributes, ConTS and
ConUCB also differ in the bandit algorithm framework they adopt: ConTS utilizes Thompson
Sampling while ConUCB utilizes UCB. To further discuss the effect of our strategy of unifying
attributes and items modeling, we design another experiment to remove the effects of different
bandit algorithms. To do that, we adapt LinUCB [29] to our setting by replacing the sampling step
with an upper confidence bound (all the rest modules in ConTS remain the same). The reward for
each arm in the adapted LinUCB (denoted as Seamless-UCB) is calculated as follows:
∑︁ √︃
E[𝑟 (𝑎, 𝑢, P𝑢 )] = u𝑇 xa + xa𝑇 p𝑖 + 𝛼 x𝑎𝑇 A𝑢−1 x𝑎 , (16)
𝑝𝑖 ∈ P𝑢
√︃
where 𝛼 x𝑎𝑇 A𝑢−1 x𝑎 denotes the upper confidence bound and u and A𝑢 are updated as:
A𝑢 = A𝑢 + x𝑎 (𝑡 ) x𝑎 (𝑡 ) 𝑇 (17)
∑︁
𝑟 𝑎′ = 𝑟 𝑎 − x𝑎 (𝑡 ) 𝑇 (u𝑖𝑛𝑖𝑡 + p𝑖 ) (18)
𝑝𝑖 ∈ P𝑢
b𝑢 = b𝑢 + 𝑟 𝑎′ ∗ x𝑎 (19)
u= A𝑢−1 b𝑢 . (20)
From Table 7 we can see that the Seamless-UCB, which uses our unifying item and attribute
strategy in the UCB framework, still outperforms ConUCB. This also demonstrates the advantage
of our seamless modeling strategy.
Interestingly, from the comparison of Seamless-UCB and ConTS, we can see that Thompson
Sampling has a better performance in our setting. This is inline with previous research [34, 40]
which reports Thompson Sampling performs better on many real-world scenarios and also reaches
a smaller theoretical regret bound. This interesting discovery indicates that Thompson Sampling
might be more powerful than UCB in conversational recommendation scenario. However, we leave
more discussions to the future works as the comparison between Thompson Sampling and UCB is
still an open question [41].

Table 7. Comparison between ConTS, ConUCB and Seamless-UCB. Bold scores are the best in each column.
Statistic significant testing is performed (𝑝 < 0.01).

Yelp LastFM Kuaishou


SR@15 AT SR@15 AT SR@15 AT
ConUCB 0.2620 13.0117 0.1540 14.2279 0.1457 13.7136
Seamless-UCB 0.7394 8.7785 0.1593 14.0126 0.4136 12.7126
ConTS 0.7558* 8.5808* 0.1628* 13.9162* 0.4872* 12.1033*

5.6 Discussion: Effect of Max Conversation Turn


As mentioned in Section 5.2.3, we set the max conversation turn to 15 in the above experiments.
However, some real-world scenarios may acquire smaller max conversation turn due to the less
patience of users. In this section, we conduct experiments to explore the situation with a much
tighter cap on utterance. Specifically, we compare the performance of all mentioned methods by
setting two smaller max conversation turn (7 and 10) on Yelp, LastFM and Kuaishou. Here, we
report the results when max conversation turn is 7 instead of 5. The reason is that several models,

ACM Transactions on Information Systems, Vol. 39, No. 4, Article 40. Publication date: August 2021.
40:22 Li and Lei, et al.

such as EAR, ConUCB, Abs Greedy, have a terrible performance on the first five turns (in Figure 3).
It is hard to get meaningful interpretations. Therefore, we report the results in turn 7 which can
better reveal the difference between these compared methods.

Table 8. Results of difference on SR@7 and AT of compared methods when the max conversation turn is 7.
Bold scores denote the best in each column, while the underlined scores denote the best baseline. Statistic
significant testing is performed (𝑝 < 0.01).

Yelp LastFM Kuaishou


SR@7 AT SR@7 AT SR@7 AT
Abs Greedy 0.1798 6.6899 0.0046 6.983 0.0637 6.7793
ConUCB 0.1692 6.5184 0.0024 6.9876 0.1029 6.6072
EAR 0.2425 6.2305 0.0606 6.9215 0.0973 6.8606
ConTS-u𝑖𝑛𝑖𝑡 0.4784 5.9671 0.0600 6.8282 0.1915 6.3848
ConTS-P𝑢 0.1237 6.5986 0.0522 6.8612 0.1828 6.3893
ConTS-exp 0.2967 6.2155 0.0406 6.8979 0.0838 6.8746
ConTS 0.5508* 5.3995* 0.0752* 6.7512* 0.2012* 6.3610*

Table 9. Results of difference on SR@10 and AT of compared methods when the max conversation turn is
10.Bold scores denote the best in each column, while the underlined scores denote the best baseline. Statistic
significant testing is performed (𝑝 < 0.01).

Yelp LastFM Kuaishou


SR@10 AT SR@10 AT SR@10 AT
Abs Greedy 0.2569 9.1584 0.0065 9.9649 0.0918 9.5315
ConUCB 0.2145 8.7965 0.0275 9.9228 0.1276 9.2240
EAR 0.3502 8.2744 0.0619 9.6966 0.1775 9.3712
ConTS-u𝑖𝑛𝑖𝑡 0.6269 7.1572 0.0964 9.5764 0.2918 8.6071
ConTS-P𝑢 0.2023 9.0030 0.0849 9.6435 0.2687 8.6696
ConTS-exp 0.3773 8.1632 0.0645 9.7279 0.1089 9.3899
ConTS 0.6609* 6.4655* 0.1112* 9.4549* 0.3168* 8.5216*

Table 8 and Table 9 report the results when the max turn set to 7 and 10 separately. It is clear to
see that ConTS still significantly outperforms all other models on both SR@(7) (or SR@10) and
AT. This demonstrates that ConTS has a better performance even when given a much smaller max
conversation turn.
Apart from it, it is interesting to see the gap of performance between ConTS and other methods
becomes larger when the max conversation turn is bigger. We attribute it to two reasons: First,
since we make recommendation for cold-start users, all the policies are somehow random at the
beginning of each conversation session, which makes the difference not obvious when the max
conversation turn is small; Second, ConTS is able to better capture user’s preference in each
conversation turn than the other methods, thus making the accumulated advantage larger as the
number of conversation turn increases.

5.7 Discussion: Effects of Cold-Start Degrees


In the above discussions, we have explored the cold start cases where historical user-item interaction
records are totally unavailable for the new users. That is the typical scenario which the Thompson

ACM Transactions on Information Systems, Vol. 39, No. 4, Article 40. Publication date: August 2021.
Seamlessly Unifying Attributes and Items: Conversational Recommendation for Cold-Start Users 40:23

Sampling is originally designed for [1, 14, 15, 31, 33]. However, in real applications, another common
cold-start case is that limited historical records are available. There comes a natural question: what
is the performance of ConTS in such case?
To have more idea of this question, we test the performance of ConTS in different cold-start
degrees. Specifically, we randomly delete the user-item interaction records with the ratio of 𝜃 for
each testing user and add such deleted records into the training and validation sets14 . Therefore,
𝜃 can be used to quantify the ”cold start degree”. Then, we use the new training data to train the
offline FM model discussed in Section 4.3 and directly obtain the initial embeddings for testing users
for the online updating15 . To make the discussion more interesting, we compare the performance
of EAR under such different cold-start degrees16 For EAR, we follow [27] to train the FM model on
the training set and train the policy network on the validation set.
The experimental results are demonstrated in Figure 4. The trend is consistent on all three
datasets. When there are very few existing historical records (e.g., 𝜃 < 0.1), ConTS consistently
gets better performances on both Average Turn and SR@15. As the existing records become more,
the performance of EAR increases much sharper than ConTS and eventually outperforms ConTS
by a large margin. This suggests that in the real application, it is better to apply ConTS to the
users when there are very few historical records. When the records become more, it would be
more helpful to shift to warm start solutions like EAR. However, to the best of our knowledge,
few research has conduct conclusive discussions on the comparison between the bandit-based and
warm-start recommendation systems in terms of the degree of cold start. We hope our discussions
can be a start to trigger more explorations in the future.

5.8 Case Study


In this section, we perform three case studies (in Figure 5) on real interaction records of users in
Kuaishou dataset for one-attribute questions by two baselines ConUCB, EAR and our proposed
model ConTS. In case (a), the user wants a video of a famous beautiful female tennis player.
The video has two attributes: beautiful women and sports. As mentioned in Section 5.4, ConUCB
implements a handcrafted way to decide whether to ask or recommend. From Figure 5 (a), we can
see that ConUCB keeps asking attributes for the first four turns in the conversation. However,
we notice that the system has already known all the user’s preferred attributes after the first two
turns. It is clearly the perfect time to recommend items as asking more questions will not bring
additional benefits. This reveals less flexibility of ConUCB in making conversation policies. EAR
decides the conversation policy questions based on historical interaction records of existing users
by a pre-trained FM model a policy network. As a result, it tends to ask some popular attributes to
new users according to training data, failing to explore new user’s interest. For example, in the
conversation in Figure 5(a), EAR insists asking attributes with similar themes (such as "movies",
"animations" and "teleplays"). Although these attributes are popular among existing users, the new
user may not be interested in them at all. In contrast, ConTS can explore new user’s unknown
interest. In the conversation, since the user gives negative feedback to attribute "arts", ConTS
learns to explore his interest on some different attributes, like "sports". And once it captures user’s
preference on an attribute, it directly recommends items aiming to successfully end the conversation
within short conversation turns.
14 we follow 7:1.5 when distributing such records.
15 Comparing to the case where no historical interactions for the testing user and we average the embedding for all
existing users as the embedding for the testing user
16 Note that EAR and ConTS use the same method (i.e., Eq.6) to rank items. The difference is that ConTS uses Thomp-

son Sampling to achieve EE balance for cold-start interaction strategy while EAR employs policy network trained by
reinforcement learning for warm-start interaction strategy.

ACM Transactions on Information Systems, Vol. 39, No. 4, Article 40. Publication date: August 2021.
40:24 Li and Lei, et al.

Fig. 4. SR@15 and AT of EAR and ConTS on Yelp, LastFM and Kuaishou with different cold-start degrees. 𝜃
quantifies the “cold start degree".

Apart from the first case where ConTS get better performances than the baselines, we also
give two more "bad" cases where ConTS is not the best. In Figure 5(b), only ConUCB successfully
completes recommendation within the first five conversation turns. The user in this case prefers to
watch a video about the development of Buddhism in China. It has the attribute religion. All the
three methods have difficulties in inferring the user’s interest at the beginning and the user gives
negative feedback to the first questions from all the three methods. However, ConUCB keeps asking
questions in the first four turns as its rule-based policy function force it to do so, while EAR and
ConTS have less patience and turn to recommending items when the user gives negative feedback.
This indicates that ConUCB may have better performances when the user’s interest is hard to
infer, because we can employ a rule-based policy function to force it to ask more questions before
making recommendations. This observation is in line with the results in Table 4: ConUCB gets
better relative performance on LastFM which is supposed to be the most difficult dataset (c.f. Table 4
where all models consistently get worst performances in LastFM dataset). Although it is possible
for EAR and ConTS to learn to patiently ask more questions before making recommendations, the
process is less tractable the policies are trained automatically.

ACM Transactions on Information Systems, Vol. 39, No. 4, Article 40. Publication date: August 2021.
Seamlessly Unifying Attributes and Items: Conversational Recommendation for Cold-Start Users 40:25

(a) The user likes a video about a famous beautiful female tennis player.

(b) The user likes a video about the development of Buddhism in China.

(c) The user likes a video about a recently released movie.

Fig. 5. Samples of conversation of ConUCB(left) , EAR(middle) and ConTS(right) of three users.

ACM Transactions on Information Systems, Vol. 39, No. 4, Article 40. Publication date: August 2021.
40:26 Li and Lei, et al.

The last case (Figure 5(c)) demonstrates a user who likes a video introducing a recently released
movie. It has the attribute movie. As introduced in the first paragraph of this section, movie is a
quite popular attribute among the existing users. Since EAR tends to ask popular attributes to new
users, it quickly hits the user’s preferred attribute and then successfully recommend items within
three conversation turns. In contrast, ConTS and EAR tend to further explore user’s interest on
some attributes which are not very popular, such as "cars" or "history". This indicates that if the
new user has similar preference with the average of existing users’, EAR can be a more efficient
method.

6 CONCLUSION AND FUTURE WORKS


In summary, this paper focuses on the cold-start problems in conversational recommendation. We
customized contextual Thompson Sampling, a classic bandit algorithm, to conversation scenario,
resulting in our ConTS model. ConTS makes the key contribution to seamlessly unify items and
attributes as undifferentiated arms in one space. As such, our ConTS addresses the three conversation
policy questions – what attributes to ask, what items to recommend, whether to ask attributes
or recommend items in a turn– as a single problem of arm choosing based on a unified reward
estimation function. With this simple strategy, ConTS nicely fits the conversational recommendation
problem in the framework of contextual Thompson Sampling, achieving good EE balance and
intelligent dialogue policy within a holistic model.
We designed a series of experiments on Yelp, LastFM and Kuaishou datasets and empirically
validated the effectiveness of our proposed ConTS on various settings. First, the model outperforms
various strong baselines (e.g., Abs Greedy, EAR and ConUCB) both on success rate in nearly all turns
and average turn for successful recommendation. Second, we validated important components,
i.e., initialization, reward estimation and exploration mechanism, by ablating each of them from
our ConTS. The results demonstrate that each of the components do help to improve ConTS’s
performance. Then, we discussed ConUCB in details as it also uses the bandit algorithm to model
item and attributes together but handle them separately. We found that, separately modeling items
and attributes cannot perform stably and hard to capture the mutual promotion of both attributes
and items. This further demonstrates the advantages of our strategy of seamlessly unifying items
and attributes in the same space. Furthermore, we conducted experiments by setting a smaller max
conversation turn and found that our ConTS still performs well with a tighter cap on conversation.
We also did a case study to compare the performance of different methods.
There are still works to do in the future to explore conversational recommendation in cold-start
scenario. First, we can make improvement on bandit algorithms by exploring different reward
estimation functions and exploring more update strategies. Second, we can also consider more
complex problem settings. For example, we can consider how to handle "don’t know" of "don’t
care" responses. Other types of response are also possible, for example "I’d prefer X to Y", and some
systems would be expected to handle this (especially if they presented in natural language). Flexibly
handling this sort of response would be a very practical advance. Third, we can extend the current
framework into neural fashion, leveraging on the powerful modeling capability of neural network
to model more complex patterns of user preferences.

7 SUPPLEMENTARY
In this supplementary, we will try to illustrate how contextual Thompson Sampling help to keep
EE balance in an intuitive way. As shown in Algorithm 1, we use a multidimensional Gaussian
distribution to describe each user’s preference on arms. So the mean 𝝁𝑢 and covariance 𝑙 2 B𝑢−1
determine the characters of the distribution. Each time when the agent plays an arm 𝑎(𝑡) with

ACM Transactions on Information Systems, Vol. 39, No. 4, Article 40. Publication date: August 2021.
Seamlessly Unifying Attributes and Items: Conversational Recommendation for Cold-Start Users 40:27

embedding x𝑎 (𝑡 ) and gets the feedback 𝑟 , it will firstly update the covariance 𝑙 2 B𝑢−1 by the formula:

B𝑢 = B𝑢 + x𝑎 (𝑡 ) x𝑎 (𝑡 ) 𝑇 . (21)

We focus on the diagonal elements of B𝑢 which have the biggest impact on the sampling result.
We denote these elements as a vector 𝝀 B , so they will be updated like this:

𝝀𝐵 = 𝝀𝐵 + x𝑎2 (𝑡 ) (22)

Apparently, the diagonal elements of matrix B𝑢 will increase after taking each action (asking
attribute or recommending item). Since the covariance matrix is 𝑙 2 B𝑢−1 , the diagonal elements of it
will decrease instead. This means our uncertainty on estimating the user’s preference on the arms
also reduces. Next time when we do sampling again, the result will be closer to the mean value
𝝁𝑢 for those played arms because of the smaller covariance value due to the decreased values in
covariance matrix. It tells the system to do less exploration on those arms since we have already
know the user’s preference on them.
As for the mean 𝝁𝑢 , we update it in this way:

B𝑢 = B𝑢 + x𝑎 (𝑡 ) x𝑎 (𝑡 ) 𝑇 (23)

f𝑢 = f𝑢 + 𝑟 𝑎′ ∗ x𝑎 (𝑡 ) (24)

𝝁𝑢 = B𝑢−1 f𝑢 . (25)
Consider the formula separately. First we fix B𝑢 and investigate the impact of f𝑢 . In that case,
after getting the feedback of arm 𝑎(𝑡) and updating 𝝁𝑢 , the agent will calculate the reward of the
played arm 𝑎(𝑡) next time as follows. The expectation of the new estimated reward is:

E[𝑟𝑒𝑤𝑎𝑟𝑑𝑛𝑒𝑤 ] = x𝑎 (𝑡 ) 𝑇 𝝁𝑛𝑒𝑤
= x𝑎 (𝑡 ) 𝑇 B𝑢−1 f𝑛𝑒𝑤
= x𝑎 (𝑡 ) 𝑇 B𝑢−1 (f + x𝑎 (𝑡 ) ∗ 𝑟 𝑎′ ) (26)
= x𝑎 (𝑡 ) B𝑢−1 f𝑜𝑙𝑑
𝑇
+ x𝑎 (𝑡 ) B𝑢−1 x𝑎 (𝑡 )
𝑇
∗ 𝑟 𝑎′
= E[𝑟𝑒𝑤𝑎𝑟𝑑𝑜𝑙𝑑 ] + x𝑎 (𝑡 ) 𝑇 B𝑢−1 x𝑎 (𝑡 ) ∗ 𝑟 𝑎′ .

Since B𝑢−1 is a positive semidefinite matrix, x𝑎 (𝑡 ) 𝑇 B𝑢−1 x𝑎 (𝑡 ) is non-negative. It is easy to see the
change of the expectation of new reward is decided by de-biased user’s feedback 𝑟 𝑎′ . For example, if
a user gives a positive feedback on an arm 𝑎(𝑡), next time the algorithm tends to score higher for
𝑎(𝑡) as well as the arms with similar embeddings with 𝑎(𝑡). And if the feedback is negative the
score will be relatively smaller. This strategy helps the algorithm to do exploitation by remembering
past experience and taking actions accordingly. As for the second part B𝑢−1 in the update formula,
we take it as a regularization term to confine the updating of 𝝁𝑢 (cause the decrease of diagonal
elements of B𝑢−1 ).
We also need to mention that since we regard all attributes and items as equivalent arms, we
can estimate the expected rewards of asking attribute or recommending items by updating the
parameters in the same way, thus enable the algorithm to choose the action intelligently according
to different situations. The model will learn to identify the benefit of taking different actions and
make the best choice accordingly.

ACM Transactions on Information Systems, Vol. 39, No. 4, Article 40. Publication date: August 2021.
40:28 Li and Lei, et al.

REFERENCES
[1] Shipra Agrawal and Navin Goyal. 2013. Thompson sampling for contextual bandits with linear payoffs. In ICML.
127–135.
[2] Mohammad Aliannejadi, Hamed Zamani, Fabio Crestani, and W. Bruce Croft. 2019. Asking Clarifying Questions in
Open-Domain Information-Seeking Conversations. In SIGIR. 475–484.
[3] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. 2002. Finite-time analysis of the multiarmed bandit problem.
Machine learning 47, 2-3 (2002), 235–256.
[4] Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. 2002. Finite-time Analysis of the Multiarmed Bandit Problem. Mach.
Learn. 47, 2-3 (May 2002), 235–256. https://doi.org/10.1023/A:1013689704352
[5] Olga Averjanova, Francesco Ricci, and Quang Nhat Nguyen. 2008. Map-based interaction with a conversational mobile
recommender system. In 2008 The Second International Conference on Mobile Ubiquitous Computing, Systems, Services
and Technologies. IEEE, 212–218.
[6] Olivier Chapelle and Lihong Li. 2011. An empirical evaluation of thompson sampling. In NIPS. 2249–2257.
[7] Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and Tat-Seng Chua. 2017. Attentive Collaborative
Filtering: Multimedia Recommendation with Item- and Component-Level Attention. In SIGIR. 335–344.
[8] Qibin Chen, Junyang Lin, Yichang Zhang, Ming Ding, Yukuo Cen, Hongxia Yang, and Jie Tang. 2019. Towards
Knowledge-Based Recommender Dialog System. In EMNLP-IJCNLP. 1803–1813.
[9] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg
Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In DLRS@RecSys. 7–10.
[10] Konstantina Christakopoulou, Alex Beutel, Rui Li, Sagar Jain, and Ed H Chi. 2018. Q&R: A Two-Stage Approach
toward Interactive Recommendation. In SIGKDD. 139–148.
[11] Konstantina Christakopoulou, Filip Radlinski, and Katja Hofmann. 2016. Towards conversational recommender
systems. In KDD. 815–824.
[12] Linus W Dietz, Saadi Myftija, and Wolfgang Wörndl. 2019. Designing a conversational travel recommender system
based on data-driven destination characterization. In ACM RecSys workshop on recommenders in tourism. 17–21.
[13] Claudio Gentile, Shuai Li, Purushottam Kar, Alexandros Karatzoglou, Evans Etrue, and Giovanni Zappella. 2016. On
context-dependent clustering of bandits. In ICML. 1253–1262.
[14] Thore Graepel, Joaquin Quinonero Candela, Thomas Borchert, and Ralf Herbrich. 2010. Web-scale bayesian click-
through rate prediction for sponsored search advertising in microsoft’s bing search engine. In ICML.
[15] Ole-Christoffer Granmo. 2010. Solving two-armed Bernoulli bandit problems using a Bayesian learning automaton.
International Journal of Intelligent Computing and Cybernetics 3, 2 (2010), 207–234.
[16] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorization-machine
based neural network for CTR prediction. IJCAI (2017).
[17] Xiangnan He and Tat-Seng Chua. 2017. Neural factorization machines for sparse predictive analytics. In SIGIR. 355–364.
[18] Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. LightGCN: Simplifying and
Powering Graph Convolution Network for Recommendation. In SIGIR.
[19] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering.
In WWW. 173–182.
[20] Xiangnan He, Hanwang Zhang, Min-Yen Kan, and Tat-Seng Chua. 2016. Fast matrix factorization for online recom-
mendation with implicit feedback. In SIGIR. 549–558.
[21] Zeng-Wei Hong, Rui-Tang Huang, Kai-Yi Chin, Chia-Chi Yen, and Jim-Min Lin. 2010. An interactive agent system for
supporting knowledge-based recommendation: a case study on an e-novel recommender system. In Proceedings of the
4th International Conference on Uniquitous Information Management and Communication. 1–8.
[22] Dietmar Jannach, Ahtsham Manzoor, Wanling Cai, and Li Chen. 2020. A Survey on Conversational Recommender
Systems. CoRR (2020).
[23] Shaojie Jiang, Pengjie Ren, Christof Monz, and Maarten de Rijke. 2019. Improving Neural Response Diversity with
Frequency-Aware Cross-Entropy Loss. In The Web Conference 2019. ACM, 2879–2885.
[24] Xisen Jin, Wenqiang Lei, Zhaochun Ren, Hongshen Chen, Shangsong Liang, Yihong Zhao, and Dawei Yin. 2018. Explicit
State Tracking with Semi-Supervisionfor Neural Dialogue Generation. In CIKM. ACM, 1403–1412.
[25] Yehuda Koren, Robert M. Bell, and Chris Volinsky. 2009. Matrix Factorization Techniques for Recommender Systems.
IEEE Computer (2009), 30–37.
[26] Tze Leung Lai and Herbert Robbins. 1985. Asymptotically efficient adaptive allocation rules. Advances in applied
mathematics 6, 1 (1985), 4–22.
[27] Wenqiang Lei, Xiangnan He, Yisong Miao, Qingyun Wu, Richang Hong, Min-Yen Kan, and Tat-Seng Chua. 2020.
Estimation–Action–Reflection: Towards Deep Interaction Between Conversational and Recommender Systems. In
WSDM.

ACM Transactions on Information Systems, Vol. 39, No. 4, Article 40. Publication date: August 2021.
Seamlessly Unifying Attributes and Items: Conversational Recommendation for Cold-Start Users 40:29

[28] Wenqiang Lei, Xisen Jin, Min-Yen Kan, Zhaochun Ren, Xiangnan He, and Dawei Yin. 2018. Sequicity: Simplifying
Task-oriented Dialogue Systems with Single Sequence-to-Sequence Architectures. In ACL. 1437–1447.
[29] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. 2010. A contextual-bandit approach to personalized news
article recommendation. In WWW. 661–670.
[30] Seth Siyuan Li and Elena Karahanna. 2015. Online recommendation systems in a B2C E-commerce context: a review
and future directions. Journal of the Association for Information Systems 16, 2 (2015), 72.
[31] Benedict C May and David S Leslie. 2011. Simulation studies in optimistic Bayesian sampling in contextual-bandit
problems. Statistics Group, Department of Mathematics, University of Bristol 11, 02 (2011).
[32] Kevin McCarthy, James Reilly, Lorraine McGinty, and Barry Smyth. 2004. On the dynamic generation of compound
critiques in conversational recommender systems. In International Conference on Adaptive Hypermedia and Adaptive
Web-Based Systems. Springer, 176–184.
[33] Kevin P Murphy. 2012. Machine learning: a probabilistic perspective. MIT press.
[34] Ian Osband and Benjamin Van Roy. 2017. On optimistic versus randomized exploration in reinforcement learning.
arXiv preprint arXiv:1706.04241 (2017).
[35] Bilih Priyogi. 2019. Preference Elicitation Strategy for Conversational Recommender System. In WSDM. ACM, 824–825.
[36] Filip Radlinski, Krisztian Balog, Bill Byrne, and Karthik Krishnamoorthi. 2019. Coached Conversational Preference
Elicitation: A Case Study in Understanding Movie Preferences. In Proceedings of the 20th Annual SIGdial Meeting on
Discourse and Dialogue. 353–360.
[37] Pengjie Ren, Zhumin Chen, Christof Monz, Jun Ma, and Maarten de Rijke. 2020. Thinking Globally, Acting Locally:
Distantly Supervised Global-to-Local Knowledge Selection for Background Based Conversation. In Thirty-Fourth AAAI
Conference on Artificial Intelligence (AAAI-20). AAAI.
[38] Steffen Rendle. 2010. Factorization machines. In ICDM. 995–1000.
[39] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian personalized
ranking from implicit feedback. In UAI.
[40] Daniel Russo and Benjamin Van Roy. 2014. Learning to optimize via posterior sampling. Mathematics of Operations
Research (2014), 1221–1243.
[41] Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen, et al. 2018. A tutorial on thompson
sampling. Foundations and Trends® in Machine Learning (2018), 1–96.
[42] Nicola Sardella, Claudio Biancalana, Alessandro Micarelli, and Giuseppe Sansonetti. 2019. An Approach to Conversa-
tional Recommendation of Restaurants. In ICHCI. Springer, 123–130.
[43] Hideo Shimazu. 2002. ExpertClerk: A Conversational Case-Based Reasoning Tool forDeveloping Salesclerk Agents in
E-Commerce Webshops. Artificial Intelligence Review 18, 3-4 (2002), 223–244.
[44] Yueming Sun and Yi Zhang. 2018. Conversational recommender system. In SIGIR. 235–244.
[45] Cynthia A Thompson, Mehmet H Goker, and Pat Langley. 2004. A personalized system for conversational recommen-
dations. Journal of Artificial Intelligence Research 21 (2004), 393–428.
[46] Rianne van den Berg, Thomas N Kipf, and Max Welling. 2017. Graph Convolutional Matrix Completion. In KDD.
[47] Huazheng Wang, Qingyun Wu, and Hongning Wang. 2017. Factorization Bandits for Interactive Recommendation.. In
AAAI. 2695–2702.
[48] Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019. Neural graph collaborative filtering. In
SIGIR. 165–174.
[49] Bifan Wei, Jun Liu, Qinghua Zheng, Wei Zhang, Chenchen Wang, and Bei Wu. 2015. DF-Miner: Domain-specific facet
mining by leveraging the hyperlink structure of Wikipedia. Knowledge-Based Systems 77 (2015), 80–91.
[50] Qingyun Wu, Naveen Iyer, and Hongning Wang. 2018. Learning Contextual Bandits in a Non-stationary Environment.
In SIGIR. 495–504.
[51] Qingyun Wu, Huazheng Wang, Quanquan Gu, and Hongning Wang. 2016. Contextual bandits in a collaborative
environment. In SIGIR. 529–538.
[52] Tong Yu, Yilin Shen, and Hongxia Jin. 2019. An Visual Dialog Augmented Interactive Recommender System. In
SIGKDD. ACM, 157–165.
[53] Hamed Zamani, Susan T. Dumais, Nick Craswell, Paul N. Bennett, and Gord Lueck. 2020. Generating Clarifying
Questions for Information Retrieval. In WWW.
[54] Xiaoying Zhang, Hong Xie, Hang Li, and John Lui. 2020. Conversational Contextual Bandit: Algorithm and Application.
In WWW.
[55] Yongfeng Zhang, Xu Chen, Qingyao Ai, Liu Yang, and W Bruce Croft. 2018. Towards conversational search and
recommendation: System ask, user respond. In CIKM. 177–186.

ACM Transactions on Information Systems, Vol. 39, No. 4, Article 40. Publication date: August 2021.

You might also like