0% found this document useful (0 votes)
34 views24 pages

Large Language Models in Personalization

The document discusses the transformative impact of large language models (LLMs) on personalization systems in artificial intelligence. It highlights how LLMs can enhance user engagement by shifting from passive information filtering to active interaction, expanding the scope of personalization beyond mere recommendations to include comprehensive service delivery. The paper also reviews existing challenges in personalization and explores the potential of LLMs to address these challenges through improved understanding of user needs and integration with various tools.

Uploaded by

sheyan.yu2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views24 pages

Large Language Models in Personalization

The document discusses the transformative impact of large language models (LLMs) on personalization systems in artificial intelligence. It highlights how LLMs can enhance user engagement by shifting from passive information filtering to active interaction, expanding the scope of personalization beyond mere recommendations to include comprehensive service delivery. The paper also reviews existing challenges in personalization and explores the potential of LLMs to address these challenges through improved understanding of user needs and integration with various tools.

Uploaded by

sheyan.yu2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

1

When Large Language Models Meet


Personalization: Perspectives of Challenges and
Opportunities
Jin Chen∗ , Zheng Liu∗ , Xu Huang, Chenwang Wu, Qi Liu, Gangwei Jiang, Yuanhao Pu, Yuxuan Lei,
Xiaolong Chen, Xingmei Wang, Defu Lian and Enhong Chen

Abstract—The advent of large language models marks a revolutionary breakthrough in artificial intelligence. With the unprecedented
scale of training and model parameters, the capability of large language models has been dramatically improved, leading to human-like
performances in understanding, language synthesizing, and common-sense reasoning, etc. Such a major leap-forward in general AI
arXiv:2307.16376v1 [cs.IR] 31 Jul 2023

capacity will fundamentally change the pattern of how personalization is conducted. For one thing, it will reform the way of interaction
between humans and personalization systems. Instead of being a passive medium of information filtering, like conventional recommender
systems and search engines, large language models present the foundation for active user engagement. On top of such a new
foundation, user’s requests can be proactively explored, and user’s required information can be delivered in a natural, interactable, and
explainable way. For another thing, it will also considerably expand the scope of personalization, making it grow from the sole function of
collecting personalized information to the compound function of providing personalized services. By leveraging large language models as
a general-purpose interface, the personalization systems may compile user’s requests into plans, calls the functions of external tools (e.g.,
search engines, calculators, service APIs, etc.) to execute the plans, and integrate the tools’ outputs to complete the end-to-end
personalization tasks. Today, large language models are still being rapidly developed, whereas the application in personalization is largely
unexplored. Therefore, we consider it to be right the time to review the challenges in personalization and the opportunities to address
them with large language models. In particular, we dedicate this perspective paper to the discussion of the following aspects: the
development and challenges for the existing personalization system, the newly emerged capabilities of large language models, and the
potential ways of making use of large language models for personalization.

Index Terms—Large Language Models, Personalization Systems, Recommender Systems, Tool-learning, AIGC

1 I NTRODUCTION The introduction of ChatGPT, in particular, has gained


significant attention from the human community, prompting
The emergence of large language models [1], which have
reflections on the transformative power of large language
demonstrated remarkable progress in understanding human
models and their potential to push the boundaries of what
expression, is profoundly impacting the AI community.
AI can achieve. This disruptive technology holds the promise
These models, equipped with vast amounts of data and
of transforming how we interact with and leverage AI
large-scale neural networks, exhibit impressive capabilities
in countless domains, opening up new possibilities and
in comprehending human language and generating text
opportunities for innovation. As these language models
that closely resembles our own. Among these abilities are
continue to advance and evolve, they are likely to shape
reasoning [2], few-shot learning [3], and the incorporation of
the future of artificial intelligence, empowering us to explore
extensive world knowledge within pre-trained models [1].
uncharted territories and unlock even greater potential in
This marks a significant breakthrough in the field of artificial
human-machine collaboration.
intelligence, leading to a revolution in our interactions
Personalization, the art of tailoring experiences to in-
with machines. Consequently, large language models have
dividual preferences, stands as an essential and dynamic
become indispensable across various applications, ranging
connection that bridges the gap between humans and
from natural language processing and machine translation
machines. In today’s technologically-driven world, person-
to creative content generation and chatbot development.
alization plays a pivotal role in enhancing user interactions
and engagements with a diverse array of digital platforms
• ∗ Equal Contribution
and services. By adapting to individual preferences, per-
• X. Huang, C. Wu, Q. Liu, G. Jiang, Y. Pu, Y. Lei, X. Chen, X. Wang,
D. Lian and E. Chen are with the University of Science and Technology
sonalization systems empower machines to cater to each
of China. Email: {xuhuangcs, wcw1996, qiliu67, gwjiang, puyuanhao, user’s unique needs, leading to more efficient and enjoyable
lyx180812, chenxiaolong, xingmeiwang}@mail.ustc.edu.cn and {liandefu, interactions. Moreover, personalization goes beyond mere
cheneh}@ustc.edu.cn. content recommendations; it encompasses various facets of
• Z. Liu is with Huawei Technologies Ltd. Co. E-mail:
[email protected]. user experiences, encompassing user interfaces, communica-
• J. Chen is with the University of Electronic Science and Technology of tion styles, and more. As artificial intelligence continues to
China. E-mail: [email protected]. advance, personalization becomes increasingly sophisticated
Corresponding author: Defu Lian (E-mail: [email protected]) in handling large volumes of interactions and diverse user
2

intents. This calls for the development of more advanced The remaining of this survey is organized as follows: we
techniques to tackle complex scenarios and provide even review the personalization and large language models in
more enjoyable and satisfying experiences. The pursuit Section 2 to overview the development and challenges. Then
of improved personalization is driven by the desire to we carefully discuss the potential actors of large language
better understand users and cater to their ever-evolving models for personalization from Section 3, following the
needs. As technology evolves, personalization systems will simple utilization of emergent capabilities and the complex
likely continue to evolve, ultimately creating a future where integration with other tools. We also discuss the potential
human-machine interactions are seamlessly integrated into challenges when large language models are adapted for
every aspect of our lives, offering personalized and tailored personalization.
experiences that enrich our daily routines.
Large language models, with their deep and broad 2 BACKGROUND OVERVIEW
capabilities, have the potential to revolutionize personal-
ization systems, transforming the way humans interact 2.1 Personalization Techniques
and expanding the scope of personalization. the interaction Personalization, a nuanced art that tailors experiences to
between humans and machines can no longer be simply the unique preferences and needs of individual users, has
classified as active and passive, just like traditional search become a cornerstone of modern artificial intelligence. In this
engines and recommendation systems. However, these large section, we explore the captivating world of personalized
language models go beyond simple information filtering techniques and their profound impact on user interactions
and they offer a diverse array of additional functionalities. with AI systems. We will delve into three key aspects of
Specifically, user intent will be actively and comprehensively personalization: recommender systems, personalized assis-
explored, allowing for more direct and seamless communica- tance, and personalized search. These techniques not only
tion between users and systems through natural language. enhance user satisfaction but also exemplify the evolution
Unlike traditional technologies that relied on abstract and of AI, where machines seamlessly integrate with our lives,
less interpretable ID-based information representation, large understanding us on a profound level. By tailoring recom-
language models enable a more profound understanding mendations, providing customized assistance, and delivering
of users’ accurate demands and interests. This deeper com- personalized search results, AI systems have the potential to
prehension paves the way for higher-quality personalized create a truly immersive and individualized user experience.
services, meeting users’ needs and preferences in a more
refined and effective manner. Moreover, the integration of 2.1.1 Recommender Systems
various tools is greatly enhanced by the capabilities of large Recommender systems play a pivotal role in personalization,
language models, significantly broadening the possibilities revolutionizing the way users discover and engage with
and scenarios for personalized systems. By transforming user content. These systems aim to predict and suggest items
requirements into plans, including understanding, generat- of interest to individual users, such as movies, products, or
ing, and executing them, users can access a diverse range of articles, based on their historical interactions and preferences.
information and services. Importantly, users remain unaware Regarding the development of recommender systems,
of the intricate and complex transformations happening they have evolved significantly over the years, with col-
behind the scenes, as they experience a seamless end-to- laborative filtering [8], [9] being one of the earliest and
end model. From this point, the potential of large language most influential approaches. Collaborative filtering relies
models in personal is largely unexplored. on user-item interaction data to identify patterns and make
This paper addresses the challenges in personalization recommendations based on users with similar preferences.
and explores the potential solutions using large language Traditional solutions, such as matrix factorization [10] and
models. In the existing related work, LaMP [4] introduces user/item-based approaches [11], extract potentially inter-
a novel benchmark for training and evaluating language esting items based on the idea that users who have shown
models in producing personalized outputs for information similar preferences in the past are likely to have similar
retrieval systems. On the other hand, other related sur- preferences in the future. While effective, collaborative
veys [5], [6], [7] focus mainly on traditional personalization filtering has limitations, such as the "cold start" problem for
techniques, such as recommender systems. From the perspec- new users and items. To address these limitations, content-
tive of learning mechanisms, LLM4Rec [5] delves into both based filtering [12] emerged, which considers the content of
Discriminative LLM for Recommendation and Generative items to make recommendations. It leverages the features and
LLM for Recommendation. Regarding the adaptation of LLM attributes of items to find similarities and make personalized
for recommender systems in terms of ’Where’ and ’How’, Li suggestions. These features can be grouped into user-side
et al [6] concentrate on the overall pipeline in industrial information, such as user profiles, item-side information [13],
recommender phases. Fan et al [7], on the other hand, [14], such as item brands and item categories, and interaction-
conduct a review with a focus on pre-training, fine-tuning, based information [15], such as reviews and comments.
and prompting approaches. While these works discuss pre- However, content-based filtering may struggle to capture
trained language models like Bert and GPT for ease of complex user preferences and discover diverse recommenda-
analysis, they dedicate limited attention to the emergent tions restricted by the limited feature representations.
capabilities of large language models. This paper aims to fill In recent years, deep learning has gained significant
this gap by examining the unique and powerful abilities of attention in the field of recommender systems due to its
large language models in the context of personalization, and ability to model complex patterns and interactions in user-
further expand the scope of personalization with tools. item data [16]. Deep learning-based methods have shown
3

promising results in capturing sequential, temporal, and Conversational Recommender Systems mark a remark-
contextual information, as well as extracting meaningful able stride forward in the realm of personalized assistance.
representations from large-scale data. With the introduction By engaging users in interactive conversations, these sys-
of deep networks, high-order interactions between features tems delve deeper into their preferences and fine-tune
of users and items are well captured to extract user in- their recommendations accordingly. Leveraging the power
terest. Deep learning-based methods offer approaches to of natural language understanding, these conversational
capture high-order interactions by employing techniques recommenders adeptly interpret user queries and responses,
like attention mechanisms [17], [18] and graph based net- culminating in a seamless and engaging user experience.
works [19] to mining complex relationships between user Notable instances of personalized assistance products, such
and item. These methods have been shown to enhance as Siri and Microsoft Cortana, have already proven their
recommendation performance by considering higher-order effectiveness on mobile devices. Additionally, the integration
dependencies and inter-item relationships. Another area of of large language models like ChatGPT further elevates
deep learning-based recommender systems is sequential the capabilities of conversational recommenders, promising
recommenders, specifically designed to handle sequential even more enhanced user experiences. As this technology
user-item interactions, such as user behavior sequences continues to progress, we can anticipate its growing sig-
over time. Self-Attentions [20] and Gated Recurrent Units nificance across diverse industries, including healthcare,
(GRUs) [21] are popular choices for modeling sequential education, finance, and entertainment. While the growth
data in recommender systems. These models excel in cap- of conversational recommenders and personalized assistance
turing temporal dependencies and context, making them promises immense benefits, it is imperative to develop these
well-suited for tasks like next-item recommendation and products responsibly. Upholding user privacy and ensuring
session-based recommendation. Sequential-based models transparent data handling practices are essential to maintain
can take into account the order in which items are interacted user trust and safeguard sensitive information.
with and learn patterns of user behavior that evolve over
time. Furthermore, the rise of language models like BERT
2.2 Large Language Models
has further advanced recommender systems by enabling a
better understanding of both natural language features and Language models perform the probabilistic modeling for
user sequential behaviors [22]. These language models can the generation of natural language, i.e., presented with
capture deep semantic representations and world knowledge, one specific context, the language models make predictions
enriching the recommendation process and facilitating more for the words which are to be generated for the future
personalized and context-aware recommendations. Overall, steps. Nowadays, the language models are mostly built
the application of deep learning techniques in recommender upon deep neural networks, where two features need to
systems has opened new avenues for research and innova- be emphasized. First of all, the majority of language models
tion, promising to revolutionize the field of personalized are based on transformers or its close variations [23]. Such
recommendations and enhance user experiences. types of neural networks are proficient at modeling context
dependency within natural languages, and exhibit superior
2.1.2 Personalized Assistance and consistently improved performances when being scaled
Personalization Assistance refers to the use of artificial up. Secondly, the language models are pre-trained at scale
intelligence and machine learning techniques to tailor and with a massive amount of unlabeled corpus. The pre-trained
customize experiences, products, or content based on indi- models are further fine-tuned with task-oriented data so as
vidual preferences, behavior, and characteristics of users. By to adapt to different downstream applications.
analyzing individual preferences, behaviors, and characteris- There have been tremendous progresses about language
tics, it creates a personalized ecosystem that enhances user models in recent years, where the emergent of large language
engagement and satisfaction. In contrast to traditional rec- models, represented by GPT-3, marks an important milestone
ommender systems, which rely on predicting user interests for the entire AI community. The large language models
passively, personalized assistance takes a more proactive (LLMs), as the name suggests, are massively scaled-up
approach. It ventures into the realm of predicting users’ derivatives of conventional language models. Particularly,
next intentions or actions by utilizing contextual information, the backbone networks and the training data have been
such as historical instructions and speech signals. This deeper largely magnified. For one thing, although there is no specific
level of understanding enables the system to cater to users’ criteria for the minimum number, a typical LLM usually
needs in a more anticipatory and intuitive manner. At the consists of no less than several billions and up-to trillions
core of this capability lies the incorporation of cutting-edge of model parameters, which are orders of larger than before.
technologies like natural language processing (NLP) and For another thing, the pre-training is conducted based on
computer vision. These advanced tools empower the system much more unsupervised corpora, with hundreds of billions
to recognize and interpret user intentions, whether conveyed or trillions of tokens carefully filtered from sources like
through spoken or written language, or even visual cues. Common Crawl, GitHub, Wikipedia, Books, ArXiv, etc. The
Moreover, the potential of personalized assistance extends impact of scaling is illustrated by the scaling laws [24],
beyond static recommendations to dynamic and context- [25], which numerically uncover the power-law relationship
aware interactions. As the system becomes more familiar between model size, data volume, training scale and the
with a user’s preferences and patterns, it adapts and refines growth of model’s performance.
its recommendations in real-time, keeping pace with the The scaling up of network and training data lead to
ever-changing needs and preferences of the user. the leap-forward of large language models’ capability. They
4

not only become more proficient at conventional skills, like 3 LLM S FOR P ERSONALIZATION
understanding people’s intent and synthesising human-like
In the following sections, we delve into the potential of large
languages, but also process capabilities which are rarely
language models for personalization, examining their evolu-
exhibited by those smaller models. Such a phenomenon
tion from simple use cases, like utilizing word knowledge as
is referred as the emergent abilities of LLMs, where three
features, to more intricate integration with other tool modules
representatives capabilities are frequently discussed. One is
to act as agents. Specifically, we focus on the progression of
the in-context learning capability, where LLMs may quickly
emergent capabilities, starting from basic world knowledge
learn from the few-shot examples provided in the prompt.
and understanding user intent, and advancing to high-
Another one is the instruct following capability. After fine-
level reasoning abilities. We explore how large language
tuned with diversified tasks in the form of instruction
models can contribute to constructing a knowledge base
tuning, the LLMs are made proficient to follow the human’s
that enriches common-sense knowledge about various items.
instructions. Thus, they may handle different tasks presented
Additionally, we discuss how the understanding capability of
in an ad-hoc manner. Last but not least, LLMs are found to
large language models can empower content interpreters and
be able to conduct step-by-step reasoning. With certain types
explainers for in-depth analysis of interactions. Furthermore,
of prompting strategies, like Chain-of-Thought (CoT), LLMs
we observe attempts to leverage the reasoning ability of large
may iteratively approach the final answer of some complex
language models for system reasoners to provide recommen-
tasks, like mathematical word problems, by breaking down
dation results. These increasingly sophisticated capabilities
the tasks into sub-problems and figuring out the plausible
enable complex utilization of large language models with
intermediate answers for each of the sub-problems.
other tool modules, enabling them to better comprehend
user intentions and fulfill user instructions. Consequently,
Thanks to the superior capabilities on understanding, rea- we also explore the integration of large language models
soning, and generating, large language models, especially the with other tools for personalization, including tool learning,
chat models produced by instruction tuning, are presented conversational agents and personalized content creators. The
as foundamental building blocks for many personalization overview of this chapter is depicted in Figure 1. Our compre-
services. One direct scenario is the conversational search and hensive survey aims to provide a deeper understanding of
recommendation. Once built upon large language models, the current landscape, shedding light on the opportunities
the search and recommendation systems will be able to and challenges associated with incorporating large language
engage with user via interactions, present outputs in a models into personalization.
verbalized and explainable way, receive feedback from the
user and make adjustment on top of the feedback, etc. The
above changes will bring about a paradigm shift for the 4 LLM S AS K NOWLEDGE BASE
personalization services, from passively making search and
Knowledge base provides rich information with seman-
recommendation, to proactively figuring out user’s need and
tics, attracting increasing attention for the usage of the
seeking for user’s preferred items. In broader scopes, the
knowledge base in the recommender systems. Particularly,
LLMs may go beyond simply making personalized search
the knowledge graphs, where nodes represent entities and
and recommendation, but play as personalized assistants to
edges represent relations in the heterogeneous information
help users with their task completions. The LLMs may take
graph, are the common format of knowledge bases and
notes of users’ important information within their memory,
introduced as side information to enhance the performance
make personalized plans based on memorized information
of recommenders. Knowledge graphs help understand the
when new demands are raised, and execute plans by leverag-
mutual relations between users and items and also provides
ing tools like search engines and recommendation systems.
better explainability for recommenders. Existing methods
that incorporate knowledge graphs in recommender systems
Yet, we have to confront the reality that applying LLMs can be classified into three main groups: embedding-based
for personalization is not a trivial problem. To name a quite methods, path-based methods and the unified methods.
few of the open challenges. Firstly, personalization calls Embedding-based methods, such as CKE [14] and DKN [26],
for the understanding of user preference, which is more KSR [27], SHINE [28], utilize semantic representations of
of domain-specific knowledge rather than the common- users and items. These methods aim to capture the underly-
sense knowledge learned by LLMs. The effectively and ing semantic relationships between entities in the knowledge
efficiently adaptation of LLMs for personalized services graph, which can improve the quality of recommendations.
remains to be resolved. Besides, the LLMs could memorize Path-based approaches, such as Hete-MF [29], SemRec [30],
user’s confidential information while providing personalized RuleRec [31], EIUM [32],exploit the semantic connectivity
services. Thus, it raises the concerns for privacy protection. information present in the knowledge graph to regularize
The LLMs are learned from Internet data; due to the exposure the user and item representations. These methods consider
bias, it is almost inevitable to make unfair predictions for the the paths between users and items in the graph and leverage
minorities. To address the above challenges, benchmarks and them to incorporate interpretability into the recommendation
evaluation datasets are needed by the research communities. process. Unified methods, such as RippleNet [33], KGCN [34],
However, such resources are far from complete at present. KGAT [35], AKUPM [36], IntentGC [37] refine the represen-
To fully support personalization with LLMs, methodological tations of entities in the knowledge graph by leveraging
and experimental frameworks need to be systematically embedding propagation techniques. These methods propa-
established for all these perspectives. gate the embeddings of entities through the graph structure,
5

Fig. 1. The Overview of LLM for Personalization

allowing information to flow across connected entities and graphs, including entity discovery [56], [57], coreference
refining the representations accordingly. resolution [58], [59] and relation extraction [60], [61]. LLMs
However, the knowledge graphs adopted in recom- can also achieve the end-to-end construction [62], [50], [42],
mender systems is limited and with low usability. Reviewing [63], [55] to directly build KGs from raw text. LLMs enables
the various knowledge graph datasets for recommender sys- the knowledge distillation to construct knowledge graphs.
tems, covering the domains of movie, book, news, product, symbolic-kg [64] distills commonsense facts from GPT3 and
etc., these datasets are still significantly sparse compared to then finetune the small student model to generate knowledge
the vast amount of human knowledge, particularly the lack graphs. These models have demonstrated the capacity to
of facts, due to the expensive supervision to construct the store large volumes of knowledge, providing a viable option
knowledge graph. Building a comprehensive and accurate for improving the scope and depth of knowledge graphs.
knowledge graph would be a complex and resource-intensive Furthermore, these advancements have prompted research
task, which would include data collection, integration, and into the direct transfer of stored knowledge from LLMs
cleaning to assure data quality and consistency. Limited by to knowledge graphs, eliminating the need for human
the expensive cost of labelling the knowledge graphs, there supervision. This interesting research throws light on the
would usually exist missing entities or relations. The user possibilities of automating knowledge graph completion
preferences for these entities or paths may be ignored, and utilizing cutting-edge big language models.
the recommendation performance suffers. By leveraging the capabilities of LLMs, recommender
The ability of Large Language Models to retrieve factual systems would benefit from a more extensive and up-to-
knowledge as explicit knowledge bases [38], [39], [40], [41], date knowledge base. Firstly, missing faculty information
[42], [43], [40], [41], [44], [45], [46] has been stirred discussed, can be completed to construct more extensive knowledge
which presents an opportunity to construct more compre- graphs and thus the relations between entities can be
hensive knowledge graphs within recommender systems. extracted for better recommenders. Secondly, in contrast to
Tracing back to the work [38], large language models have the preceding exclusively in-domain data, the large language
shown their impressive power in storing factual information, model itself contains plenty of cross-domain information
such as entities and common-sense, and then commonsense that can help achieve cross-domain recommendations, such
knowledge can be reliably transferred to downtown tasks. as recommending appropriate movies based on the user’s
Existing methods in knowledge graphs fall short of favorite music songs. To sum up, the stored knowledge
handling incomplete KGs [47] and constructing KGs with can be utilized to enhance recommendation accuracy, rele-
text corpus [48] and many researchers attempt to leverage vance, and personalization, ultimately improving the overall
the power of LLM to solve the two tasks, i.e., the knowledge performance of recommender systems. Existing work [65]
completion [49] and knowledge construction [50]. For knowl- prompts the large language models to generate the factual
edge graph completion, which refers to the task of missing knowledge about movies to enhance the performance of CTR
facts in the given knowledge graph, recent efforts have been prediction models. To better utilize the factual knowledge, a
paid to encode text or generate facts for knowledge graphs. Knowledge Adaptation module is adopted for better contextual
MTL-KGC [51] encoders the text sequences to predict the information extraction.
possibility of the tuples. MEMKGC [52] predicts the masked It is worth noting that the phantom problem of large
entities of the triple. StAR [53] utilizes Siamese textual language models can be a challenge when applied to recom-
encoders to separately encode the entities. GenKGC [54] mendation tasks. The inherent nature of large language mod-
uses the decoder-only language models to directly generate els can introduce ambiguity or inaccurate provenance [66].
the tail entity. TagReal [55] generates high-quality prompts This issue can emerge as the introduction of extraneous
from the external text corpora. AutoKG [48] directly adopts information or even noise into the recommendation pro-
the LLMs, such as ChatGPT and GPT-4, and design tailored cess. The large language models may generate responses
prompts to predict the tail entity. As for the another impor- that, while syntactically correct, lack informative context or
tant task, i.e., knowledge graph construction, which refers relevance. According to the KoLA [67], a benchmark for
to creating a structured representation of knowledge, LLMs evaluating word knowledge of LLMs, even the top-ranked
can be applied in the process of constructing knowledge GPT4 just achieves 0.012 in Precision and 0.013 in Recall
6

on the task Named Entity Recognition, which falls far short Following the advancements in neural natural language
of the performance (0.712 in Precision and 0.706 in Recall) processing (NLP) models, more sophisticated architectures
of the task specific models PL-Marker [68]. Such a finding such as Convolutional Neural Networks (CNNs), Recurrent
suggests that common sense is still far from being sufficiently Neural Networks (RNNs), and Neural Attention models have
captured by LLM. By aggregating the results with irrelevant been employed as content interpreters to extract contextual
or deceptive information, this can damage the usefulness of information and capture user preferences. These models take
the recommendation system. sentence inputs, such as news titles, reviews, or comments,
and transform them into word embedding matrices using
random initialization or word2vec embeddings [78]. Various
5 LLM S AS C ONTENT I NTERPRETER architectures, including CNNs, attention networks, and hy-
Content-based recommenders provide an effective solution brid models, are utilized to learn representations of sentences.
for mitigating the sparse feedback issue in recommender For example, NPA [79] and LSTUR [80] incorporate attention
systems. By leveraging the attributes and characteristics of mechanisms to determine the importance of words after
items, these systems achieve a more profound understanding CNN layers. NRMS [81] and CPRS [82] utilize multi-head
of their properties, facilitating accurate matching with user self-attention networks to learn word representations. These
preferences. However, the content features used in content- models are effective in capturing long-context dependencies
based recommendation may also exhibit sparsity. Relying and understanding the semantic information in the text. In
solely on the recommended supervision signal, such as addition to text modeling, language models are also used as
clicking and browsing, might not fully exploit the potential content interpreters to capture user interests based on their
benefits of these features. To overcome this challenge, lan- historical interactions. For instance, WE3CN [83] employs
guage models emerge as powerful fundamental algorithms a 3D CNN to extract temporal features from the historical
that act as content interpreters in processing textual features. data. DKN [26] utilizes an attention mechanism to aggregate
Their utilization enhances the effectiveness of recommender historical information related to candidate items. DAN [84]
systems by effectively understanding and interpreting textual proposes an attention-based LSTM to capture richer hidden
content, leading to improved recommendations. sequence features. These models leverage different neural
architectures to enhance the representation of text in the
context of recommendation systems. It is worth noting that
5.1 Conventional Content Interpreter these models still have limitations in terms of depth and the
Conventional content interpreter includes statistical model, ability to effectively generalize semantic information.
neural network, and advanced NLP network, as summarized
in Figure 2. These approaches primarily focus on transform-
ing content information, such as textual data, into feature 5.2 Language Model based Content Interpreter
embeddings to facilitate the recommendation process. In recent years, there has been a growing interest in incor-
Statistical models like TF-IDF, Minimum Description porating more powerful pre-trained language models, such
Length (MDL) [69], and bag-of-words have been traditionally as BERT and GPT, into recommendation systems. These
used to encode textual data such as news articles and language models have shown exceptional performance in
documents into continuous value vectors. However, with the various natural language processing tasks and have sparked
advancement of deep learning techniques, researchers have researchers’ inspiration to leverage them for capturing deep
explored various neural network architectures to learn more semantic representations and incorporating world knowl-
expressive content representations. Instead of relying solely edge in recommendation systems. However, applying pre-
on statistical embeddings, some approaches initialize the trained language models to recommendation tasks presents
vectors with bag-of-words representations and then employ two main challenges. Firstly, there is a misalignment of
autoencoder-based models to learn more powerful represen- goals between general-purpose language models and the
tations. For example, CDL [16] combines the latent vectors specific objectives of recommendation systems. To address
obtained from autoencoders with the original ID embeddings this, researchers have proposed approaches that fine-tune
to enhance content representations. CRAE [70] introduces a the pre-trained models or design task-specific pre-training
collaborative recurrent autoencoder that captures the word tasks to adapt them to recommendation tasks. For example,
order in texts, enabling the modeling of content sequences in U-BERT [85] employs BERT as a content interpreter and intro-
collaborative filtering scenarios. Dong et al. [71] propose a duces masked opinion token prediction and opinion rating
stacked denoising autoencoder that reconstructs item/user prediction as pre-training tasks to better align BERT with
ratings and textual information simultaneously, allowing for recommendation objectives. Similarly, other works [86], [87],
the joint modeling of collaborative and textual knowledge. [88], [89], [90] have utilized pre-trained BERT to initialize
CVAE [72] introduces a collaborative variational autoencoder the news encoder for news recommendation, enhancing the
that learns probabilistic textual features. While autoencoders representation of textual features. The pre-trained model,
are effective in learning low-dimensional representations ERNIE, is also utilized to enhance the representation ability
from text data, they may struggle to capture semantic of queries and documents [91], [92]. The second challenge is
information effectively [73]. In some cases, approaches like reducing the online inference latency caused by pre-trained
doc2vec [74] are used to construct content embeddings [75], language models, which can be computationally expensive.
[76] and learn hidden representations. Okura et al. [77] Researchers have explored techniques such as knowledge
evaluate different network architectures, including word- distillation and model optimization to obtain lightweight
models and GRU networks, for representing user states. and efficient models suitable for online services. For instance,
7

ℎ 𝑦 ℎ ℎ
Decode Pool/Attention Pre-trained
MDL • Language
Example

ℎ 𝑐 Language
Model Knowledge
𝑥 Convolution
Encode • Reasoning
𝑥 𝑥 PLM • Generaliza
Bag-of-word … tion
Bag-of-word Embedding
• …
Text Text Text Text
Development

MLP, CNN, RNN, Bert ,GPT,


MDL GPT3.5
Auto-encoder Attention ERNIE

Statistical Neural Advanced Pre-trained Pre-trained Large


Model Network NLP Network Language Model Language Model

Fig. 2. The development of content interpreter in recommendation.

CTR-BERT [93] employs knowledge distillation to obtain traditional recommenders. PALR [104] further enhances the
a cache-friendly model for click-through rate prediction, construction pipeline of recommendation-specific instruc-
addressing the latency issue. tion tuning, which first employs large language models
Moreover, pre-trained language models have been ap- to generate reasoning as additional features based on the
plied beyond mainstream recommendation tasks. They have user’s behavior history. Next, a small set of candidates
been integrated into various recommendation scenarios, in- is retrieved using any existing model based on the user
cluding tag recommendation [94], tweet representations [95], profile. Finally, to adapt general-purpose language models
and code example recommendation [96], to enhance the to the recommendation task, they convert the generated
representation of textual features in those specific domains. reasoning features, user interaction history, and retrieved
Additionally, some recent works [97], [98], [99], [100] have candidates into natural language instruction data and fine-
explored using only textual features as inputs to recom- tune a language model. Existing instruction tuning methods
mendation models, leveraging pre-trained language models of language models for recommendation scenarios typically
to alleviate cold-start problems and enable cross-domain focus on a single type of recommendation task, limiting
recommendations. This paradigm offers advantages in alle- the full utilization of language models’ strong generaliza-
viating cold-start problems and facilitating cross-domain rec- tion ability. InstructRec [105] addresses this limitation by
ommendations based on the universality of natural language. formulating recommendation as an instruction-following
ZESREC [97] uses BERT to obtain universal continuous procedure. They design various instruction templates to
representations of item descriptions for zero-shot recom- accommodate different recommendation tasks and employ
mendation. Unisrec [98] focuses on cross-domain sequential GPT-3.5 to generate high-quality instruction data based on
recommendation and employs a lightweight MoE-enhanced the user’s historical data and templates. The language models
module to incorporate the fixed BERT representation into the fine-tuned using this instruction data can effectively handle
recommendation task. VQ-Rec [99] further aligns the textual a wide range of recommendation tasks and cater to users’
embeddings produced by pre-trained language models to the diverse information requirements.
recommendation task with the help of vector quantization.
Fu et al. [101] explore layerwise adaptor tuning to achieve 6 LLM S AS E XPLAINER
parameter-efficient transferable recommendations. In addition to valuing the suggestions made by a recommen-
While the pre-trained language models empower the text dation model, users are also interested in the comprehensible
understanding with the benefit of capturing world knowl- justifications for these recommendations [106], [107]. This
edge first, the development of pre-trained large language is crucial as most recommender systems are black boxes
model provides great emergency ability in the fields of rea- whose inner workings are inscrutable to human under-
soning and generalization. TALLRe [102] explores the ability standing [108], diminishing user trust. Taking drug recom-
of large language models for the sequential recommendation. mendations, for instance, it is unacceptable to recommend
They observe that original language models perform poorly drugs with good curative effects simply but fail to give
in zero-shot and few-shot scenarios, while recommendation- reasons why they are effective. To this end, explainable
specific instruction-tuned language models demonstrate recommendations aim to couple high-quality suggestions
superior performance in few-shot learning and cross-domain with accessible explanations. This not only helps to improve
generalization. Similarly, Kang et al. [103] propose a similar the model’s transparency, persuasiveness, and reliability, but
instruction tuning method for rating prediction recommenda- also facilitates the identification and rectification of potential
tion tasks based on the T5 backbone. They find that the tuned errors through insightful explanations. These benefits have
language models, which leverage data efficiency, outperform been extensively documented in recent work [109], [110],
8

TABLE 1
LLMs for Content Interpreter

Approach Task LLM backbone Tuning Strategy Datasets


TALLRe [102] Sequential Recommendation LLaMA-7B Instruct Tuning & Fine Tuning MovieLens100k, BookCrossing
LLMs-Rec [103] Rating Prediction Flan-T5-Base, Flan-T5-XXL Fine Tuning MovieLens-1M, Amazon Book
PALR [104] Item Recommendation LLaMa-7B Instruction Tuning MovieLens-1M, Amazon Beauty
sequential recommendation
InstructRec [105] Flan-T5-XL Instruction Tuning Amazon-Games, CDs
personalized search

[111], [112]. For instance, [110] conducted a study that to obtain a controllable justification. Considering that such
involved addressing 40 difficult tasks and evaluating the auxiliary information is not always available in real-world
impact of explanations on zero-shot and few-shot scenarios. scenarios, ExBERT [128] only requires historical explanations
Their findings demonstrated that explanations have a posi- written by users, and utilizes a multi-head self-attention
tive effect on model performance by establishing a connection based encoder to capture the relevance between these
between examples and interpretation. explanations and user-item pairs. Recently, MMCT [130],
Traditional approaches mainly focus on template-based EG4Rec [131], and KAER [132] have further carried out
explanations, which can be broadly categorized into item- finer-grained modeling of information such as visual images,
based, user-based, and attribute-based explanations[113]. time series, and emotional tendencies to obtain high-quality
Item-based explainable methods relate recommendations interpretations.
to familiar items [114], explaining that the recommended item Due to the limited expressive power of traditional lan-
bears similarity to others the user prefers, which are prevalent on guage models, natural language generation methods are
platforms like Amazon [115] and Netflix [116]. However, due prone to long-range dependence problems [128], that is, the
to its collaboration, it may underperform in personalized input of long texts will appear to generate explanations
recommendations requiring diversity and can struggle to that lack diversity and coherence in content. In addition,
identify relevant items among industrial settings with vast these explanation methods are tightly coupled with specific
items efficiently. In contrast, user-based explanations [117] recommendation models (e.g., NETE [124]), or directly
leverage social relationships to make recommendations design a new recommendation model (e.g., NRT [122],
by explaining that users with similar interests also favor the PETER [127]), and they are often powerless when faced with
recommended item. The user’s social property makes these existing advanced recommendation models, which limits
explanations more persuasive, encouraging users to try the their generalizability. This is also a flaw in template-based
recommendations. However, the variance in user preferences methods. Notably, in industrial settings, recommendation
may render this approach less impactful in gauging actual algorithms frequently involve not just a single model but
preference. Lastly, attribute-based explanations focus on a cascade or integration of multiple models, and these
highlighting the attributes of recommended items that users elaborate combinations further exacerbate the difficulty of
might find appealing, essentially conveying "these features deciphering recommendations.
might interest you". This method demands customization ac- Thanks to LLMs’ remarkable generative ability in lan-
cording to each user’s interests, yielding higher accuracy and guage tasks, making them ideal for tackling the aforemen-
satisfaction. Thus, they are at the forefront of research [106], tioned challenges [133]. Firstly, with the leverage of extensive
[118], [119], [120], [121]. training data, LLMs adeptly harness human language, en-
Obviously, such explanations typically employ pre- compassing context, metaphors, and complex syntax. This
defined and formulaic formats, such as explanations based equips them to craft customized explanations that are precise,
on similar items or friends. Although capable of conveying natural, and adaptable to various user preferences [124],
essential information, such inflexible formats may diminish [127], [134], mitigating the limitations of conventional, for-
the user experience and satisfaction by lacking adaptability mulaic explanations. Secondly, the unique in-context learning
and personalization [106]. For this reason, natural language capabilities of LLMs, such as zero-shot prompting, few-
generation approaches have received increasing attention. shot prompting, and chain-of-thought prompting, enable
Early work [122], [123], [124] mainly relied on recurrent them to garner real-time user feedback during interactions,
neural networks (e.g., LSTM [125], GRU [126]). Limited by furnish recommendation outcomes, and their corresponding
the model’s expressiveness, they often suffer from the issue interpretations, fostering bidirectional human-machine align-
of insufficient diversity. With the excellent performance of ment. Recent study [135] has demonstrated the potential of
Transformer-based models in various natural language tasks, LLMs in elucidating the intricacies of complex models, as
some work attempts to integrate Transformer-based models evidenced by GPT-4 autonomously interpreting the function
into explainable recommendations. [127] use the position of GPT-2’s each neuron by inputting appropriate prompts
vectors corresponding to the user (item) IDs to predict and the corresponding neuron activation. This showcases
interpreted tokens. Subsequent work [128] has shown that an innovative approach to interpreting deep learning-based
the generated explanation cannot justify the user’s preference recommendation models. It’s critical to highlight that this
by synthesizing irrelevant descriptions. Therefore, Ni et interpretation technique is agnostic to the model’s architec-
al. [129] used such information as guided input to BERT ture, distinguishing it from traditional interpretations that
9

are bound to specific algorithms. Thus, recommendation emergent ability enables prediction on new cases without
interpretations founded on LLMs pave the way for a ver- tuning unlike previous machine learning.
satile and scalable interpretational framework with broader In the realm of recommender systems, numerous stud-
applicability. ies have explored the performance of zero-shot/few-shot
Although LLMs have inherently significant advantages in learning using large language models, covering the common
recommendation explanations, it is imperative to recognize recommendation tasks such as rating prediction, and ranking
potential issues. Firstly, akin to recommendation models, prediction. These studies evaluate the ability of language
LLMs are essentially black boxes that are difficult for humans models to provide recommendations without explicit tun-
to understand. We cannot identify what concepts they give ing, as summarized in Table 2, where all methods adopt
explanations based on [136]. Also, the explanation given in-context learning for direct recommenders. The general
may be insincere; that is, the explanations are inconsistent process can be attached in Figure 3. Accordingly, we have
with their recommended behaviors. Some recent develop- the following findings:
ments [137], [111] involve utilizing chains of thought to
• The aforementioned studies primarily focused on
prompt reasoning for improved interpretability; however,
evaluating zero-shot/few-shot recommenders using
the opacity of the reasoning process of each step remains a
open-domain datasets, predominantly in domains
concern, and [138] has questioned the possible unfaithful
such as movies and books. Large language models are
explanations of chain-of-thought prompting. Secondly, the
trained on extensive open-domain datasets, enabling
extensive data utilized by LLMs may encompass human
them to possess a significant amount of common-
biases and erroneous content [139]. Consequently, even if
sense knowledge, including information about well-
the explanation aligns with the model’s recommendation
known movies. However, when it comes to private
behavior, both the explanation and recommendation could
domain data, such as e-commerce products or specific
be flawed. Monitoring and calibrating these models to ensure
locations, the ability of zero-shot recommenders lacks
fairness and accuracy in explainable recommendations is
of validation, which is expected to be challenging.
essential. Lastly, generative models exhibit varying levels of
• Current testing methods necessitate the integration
proficiency across different tasks, leading to inconsistencies
of additional modules to validate the performance of
in performance. Identical semantic cues could yield disparate
zero-shot recommenders for specific tasks. In particu-
recommendation explanations. This inconsistency has been
lar, for ranking tasks that involve providing a list of
substantiated by recent studies [140], [141] focusing on the
items in order of preference, a candidate generation
LLMs’ robustness. Addressing these issues calls for exploring
module is employed to narrow down the pool of
techniques to mitigate or even circumvent low-reliability
items [145] and [146]. Generative-based models like
explanatory behavior, and investigating how LLMs can be
gpt-3.5-turbo generate results in a generative manner
trained to consistently generate reliable recommendation
rather than relying on recall from existing memories,
explanations, especially under adversarial conditions, is a
thus requiring additional modules to implement ID-
worthwhile avenue for further research.
based item recommendations.
• From the perspective of recommendation perfor-
7 LLM S AS C OMMON S YSTEM R EASONER mance, zero-shot recommenders exhibit some capabil-
ities and few-shot learners perform better than zero-
With the development of large language models, there is an
shot recommenders. However, there still exists a sub-
observation that LLMs exhibit reasoning abilities [2], [142]
stantial gap when compared to traditional recommen-
when they are sufficiently large, which is fundamental for
dation models, particularly fine-tuned large language
human intelligence for decision-making and problem-solving.
models designed specifically for recommenders, such
By providing the models with the ‘chain of thoughts’ [111],
as P5[155] and M6-Rec [156]. This highlights that
such as prompting with ’let us think about it step by step’, the
large language models do not possess a significant
large language models exhibit emergent abilities for reason-
advantage in personalized modeling.
ing and can arrive at conclusions or judgments according
to the evidence or logics. Accordingly, for recommender Another important emergent ability is the ‘step by step’
systems, large language models are capable of reasoning to reasoning, where LLMs can solve complex tasks by utilizing
help user interest mining, thus improving performance. prompts including previous intermediate reasoning steps,
called the ‘chain of thoughts’ strategy [111]. Wang and
Lim [145] design a three-step prompt, namely NIR, to capture
7.1 Making Direct Recommendations
user preferences, extract the most representative movies
In-context learning [148], [149], [150], [151], [152], [153], [154] and rerank the items after item filtering. Such a multi-step
is one of the emergent abilities of LLMs that differentiate reasoning strategy significantly improves recommendation
LLMs from previous pre-trained language models, where, performance.
given a natural language instruction and task demonstra-
tions, LLMs would generate the output by completing the
word sequence without training or tuning [3]. As for in- 7.2 Reasoning for Automated Selection
context learning, the prompt follows by the task instruction Automated Machine Learning (AutoML) is widely applied
and/or the several input-output pairs to demonstrate the in recommender systems to eliminate the costly manual
task and a test input is added to require the LLM to make setup with trials and errors. The search space in recom-
predictions. The input-output pair is called a shot. This mender systems can be categorized in (1) Embedding size
10

TABLE 2
Zero/few-shot learners of LLMs for RS

Approach LLM backbone Task Metric Datasets ICL COT


rating prediction RMSE,MAE
sequential recommendation HR,NDCG
[143] gpt-3.5-turbo direct recommendation HR,NDCG Amazon Beauty ✓
explanation generation BLUE4,ROUGE,Human Eval
review summarization BLUE4,ROUGE,Human Eval
MovieLens-1M
text-davinci-002 point-wise
Amazon-Book
[144] text-davinci-003 pair-wise NDCG,MRR ✓
Amazon-Music
gpt-3.5-turbo list-wise
MIND-small
Flan-U-PALM
rating prediction RMSE,MAE MovieLens-1M
[103] gpt-3.5-turbo ✓
ranking prediction ROC-AUC Amazon-Books
text-davinci-003
[145] text-davinci-003 reranking NDCG,HR MovieLens 100K ✓ ✓
MovieLens-1M
[146] gpt-3.5-turbo reranking NDCG ✓
Amazon-Games
[147] gpt-3.5-turbo reranking Precision MIND ✓

Fig. 3. An Example of zero/few-shot learning for direct recommenders

(2) Feature (3) Feature interaction (4) Model architecture. prior results to give the next recommended choice.
Embedding size search, such as [157], [158], [159], [160] The emergent LLMs actually have excellent memorization
seeks for appropriate embedding size for each feature to and reasoning capability that would work for automated
avoid resources overconsumption. Searching for features learning. Several works have attempted to validate the poten-
consisting of raw feature search[161], [162] and synthetic tial of automated machine learning with LLMs. Preliminarily,
feature search[163], [164], which selects a subset from the GPT-NAS [174] takes advantage of generative capability of
set of original or cross features to maintain informative LLMs. The architecture of networks are formulated into
features to reduce both computation and space cost. Feature sequential characters, and thus the generation of network
interaction search, such as [165], [166], [167], [168], [169], architectures can be easily achieved through the generative
automatically filters out feature interactions that are not pre-training models. NAS-Bench-101 [175] is utilized for
helpful. Model architecture search, like [170], [171], [172], pre-training and the state-of-the-art results are used for
[173], expands the search space to the integral architectures. fine-tuning. The generative pre-training models produce
The search strategy shifts from the discrete reinforcement reasonable architectures, which would reduce the search
learning process, which iteratively samples architectures space for later genetic algorithms for searching optimal
for training and is time-consuming, into the differentiable architectures. The relatively advanced reasoning ability is
searching, which adaptively selects architectures within one- further evaluated in GENIUS [176], where GPT-4 is employed
shot learning to circumvent the computational burden, for as a black-box agent to generate potential better-performing
more efficient convergence. The evaluation for each sampled architectures according to previous trials including tried
architecture then acts as the signal to adjust the selections. architectures with their evaluation performance. According
That is, there is a decision maker who memorizes the prior to the results, GPT-4 can generate good architecture networks,
results of previous architecture choices and analyzes the showing the potential for more complicated tasks. Yet it is too
11

difficult for LLMs to directly make decisions on challenging recommender systems, task-oriented dialogue systems are
technical problems only by prompting. To balance efficiency more often required, as they are specifically designed to
and interpretability, one approach is to integrate the LLMs assist users in accomplishing specific tasks. For task-oriented
into certain search strategies, where the genetic algorithm dialogue systems, a common approach [185], [186] is to
guides the search process and LLMs generate the candidate treat the response generation as a pipeline and handle it
crossovers. LLMatic [177] and EvoPrompting [178] use code- separately using four components: dialogue understand-
LLMs as mutation and crossover operators for a genetic NAS ing [187], [188], dialogue state tracking [189], [190], [191],
algorithm. During evolution, each generation has a certain dialogue policy learning [192], [185], and natural language
probability of deciding whether to perform crossover or generation [193], [194]. Another approach is to employ an
mutation to produce new offspring. Crossover and mutation end-to-end method [195], [196], [197], training an encoder-
are generated by prompting LLMs. Such a solution integrates decoder model to handle all the processing steps collectively.
LLM into the genetic search algorithm, which would achieve The first approach suffers from scalability issues and lacks
better performances than direct reasoning. synergy between the components, while the second approach
The research mentioned above brings valuable insights to requires a substantial amount of supervised data for training.
the field of automated learning in recommender systems. Based on the classification of dialogue systems, common
However, there are several challenges that need to be approaches in conversational recommender systems can also
addressed. Firstly, the search space in recommender systems be divided into two categories: attribute-based QA (question-
is considerably more complex, encompassing diverse types answering) and generative methods. The attribute-based
of search space and facing significant volume issues. This QA approach [185], [198], [199], [200] utilizes a pipeline
complexity poses a challenge in effectively exploring and method within the dialogue system. In each dialogue turn,
optimizing the search space. Secondly, compared to the the system needs to decide whether to ask the user a
common architecture search in other domains, recommender question or provide a recommendation. The decision-making
systems lack a strong foundation of knowledge regarding the process, particularly regarding which attribute to ask about,
informative components within the search space, especially is typically handled by a policy network. On the other hand,
the effective high-order feature interactions. Unlike well- generative methods do not explicitly model the decision-
established network structures in other areas, recommender making process. Instead, they often employ an end-to-end
systems operate in various domains and scenarios, resulting training approach, where a sequence-to-sequence model
in diverse and domain-specific components. Addressing generates output directly from a shared vocabulary of words
these challenges and advancing the understanding of the and items. Whether the generated output is chit-chat, a
search space and informative components in recommender question, or a recommendation is implicitly determined
systems will pave the way for significant improvements in during the generation process. Compared to attribute-based
automated learning approaches. QA methods, generative methods [201], [202], [203], [204]
appear to be simpler and more scalable. However, they
require a large amount of supervised training data. With
8 LLM S AS C ONVERSATIONAL AGENT the advancement of pre-trained language models (PLMs)
Conversational recommender system (CRS) is a specialized in the field of natural language processing, particularly
type of recommendation tool that aims to uncover users’ models like BERT [205] and GPT [206], the capabilities of pre-
interests and preferences through dialogue, enabling per- trained models in language understanding and generation
sonalized recommendations and real-time adjustment of have become increasingly powerful. Researchers have found
recommendation strategies based on user feedback. Com- that fine-tuning pre-trained models with a small amount of
pared to traditional recommender systems, conversational supervised data can yield impressive results on specific tasks.
recommender systems have the advantage of real-time under- This discovery has led to the application of PLMs in gen-
standing of user intents and the ability to adapt recommen- erative conversational recommender systems. For example,
dations based on user feedback. Typically, a conversational DialoGPT [197] achieved promising dialogue intelligence by
recommender system consists of two main components: a fine-tuning GPT-2 on dialogue data collected from platforms
dialogue module and a recommendation module. In this like Reddit. Subsequently, BARCOR [202], RecInDial [204],
section, we will primarily focus on discussing the dialogue and UniCRS [203] utilized DialoGPT for constructing conver-
module, which plays a crucial role in facilitating effective sational recommender systems, with variations in their action
user-system interactions and understanding user preferences. decision strategies. While PLMs reduce the dependency
In a conversational recommender system, the dialogue of generative dialogue models on extensive data, the fine-
module typically takes the form of a dialogue system. tuning process still incurs significant computational time
Dialogue systems can generally be classified into two main and requires the collection of high-quality domain-specific
categories: chit-chat and task-oriented. The former focuses on training data due to the large parameter space of the models.
open-domain question answering, and two major methods With the increase in model parameters and training data,
are commonly employed: generative and retrieval-based the intelligence and knowledge capacity of models continues
methods. Generative methods [179], [180], [181] utilize a to improve. OpenAI has been expanding the model parame-
sequence-to-sequence model structure to generate responses, ters and training data while employing techniques such as
while retrieval-based methods [182], [183], [184] transform RLHF (Reinforcement Learning from Human Feedback) and
the task of generating responses into a retrieval problem Instruction Tuning to further fine-tune GPT-3 [3]. This has led
by searching for the most relevant response in a response to the emergent abilities of models like InstructGPT [207] and
database based on the dialogue context. In conversational subsequent models like ChatGPT, which exhibit incredible
12

intelligence and have opened the doors to new intelligent the strong intelligence and reasoning capabilities of LLMs,
dialogue systems based on large language models (LLMs). effectively harnessing these abilities requires well-crafted
Furthermore, Google’s BARD and META’s LLaMA [208] are prompts for guidance. For instance, the Chain of Thought
also large language dialogue models that have been proposed proposed by Jason [111] could trigger LLM to reason and
and demonstrated remarkable performance in conversational engage in step-by-step thinking, which benefits the tool-
abilities. The Vicuna model, for instance, utilizes dialogue using capability. Subsequent studies like ToT [212], Plan-and-
corpora shared by users in using ChatGPT to fine-tune the Solve [213] and ReAct [214] have proposed more advanced
open-source LLaMA model, with the team claiming it can techniques for prompt design to assist in guiding LLM to
achieve over 90% of ChatGPT’s capability. This series of engage in deeper thinking and tool planning.
successive LLM introductions has brought new insights to The second challenge lies in the issue of memory and
conversational recommender systems. Due to the utilization comprehension in long conversations. Due to the input
of extensive open-domain corpora during LLM training, it constraints of LLMs, models like ChatGPT can support a
possesses inherent conversational recommendation capabili- maximum of 4096 tokens in a single call, including both
ties and can provide reasonable recommendations in open input and output. In multi-turn dialogue scenarios, longer
domains such as movies, music, and games. dialogue contexts often meet the risk of exceeding this token
However, there are still significant challenges in building limit. The simplest approach to tackle this challenge is to
an enterprise-level CRS. The first challenge is the lack of trim the dialogue by discarding earlier turns. However, in
awareness of large models about private domain data. It is conversational recommender systems, users may express a
well known that most of the training data for LLMs, such as significant amount of personal information and interests in
GPT-3, comes from publicly available sources on the internet. the early stages of the conversation. The omission of such
As a result, these models may lack visibility into the data that information directly impacts the accuracy of recommenda-
resides within information platforms, making their modeling tions. To address this issue, several relevant works have
and understanding capabilities of such data relatively poor. proposed solutions. MemPrompt [215] enhances the prompt
To address this challenge, there are currently two approaches by incorporating a memory module, enabling GPT-3 to
being explored: fine-tuning [197] and tool learning [209], possess stronger long-dialogue memory capability. Similarly,
[210]. Fine-tuning involves tuning LLM using private RecLLM [210] leverages LLM to extract user profiles and
domain-specific dialogue data. There are two major concerns store them as factual statements in user memory. When
in the approach. First, massive high-quality domain-specific processing user queries, relevant facts are retrieved based on
dialogue data is required to tune the extremely large model. text similarity.
However, in most recommendation scenarios, data primarily
consists of explicit or implicit user-item interactions, which 9 TOOL -L EARNING AND ITS A PPLICATIONS IN R EC -
may lack conversational context. Therefore, generating high-
OMMENDATION
quality dialogue data from interaction data is a key concern in
the approach. In RecLLM [210] and iEvaLM [211], researchers 9.1 LLM-based Tool Learning
have proposed using LLMs to construct a user simulator Tool learning is an emerging research field that aims to
for generating conversational data. Besides, the fine-tuning enhance task-solving capabilities by combining specialized
technique plays a crucial role in determining the ultimate tools with foundational models, which has been understood
quality of LLMs. A well-designed and effective fine-tuning by [216] as two perspectives:
strategy can lead to significant improvements in the model’s
1) Tool-augmented learning treats specialized tools as
performance and capabilities, such as instruction tuning and
assistants in order for improving the quality and
RLHF proposed in InstructGPT [3]. Tool learning is another
accuracy of tasks, or Tool for AI;
approach to address this challenge, and its main idea is
2) Tool-oriented learning focuses more on training
to treat traditional recommendation models as tools to be
models to effectively use tools, controlling and
utilized, such as Matrix Factorization (MF) and DeepFM. For
optimizing tool-applying processes, or AI for Tool.
a more detailed explanation of tool learning, please refer to
Section 9. Since recommendation models are domain-specific, Tool learning has found applications in various fields, and
LLM can leverage these models to obtain recommendation this section primarily focuses on tool learning paradigms
results and recommend them to the users in the response. based on large language models (LLMs). While recent
In this approach, there are two main technical points: the works often involve a combination of these two perspec-
construction of the tool model and the engineering of tives, we do not specifically categorize each work into
prompts to guide the LLM in the proper utilization of the tool. one type. LLMs, such as GPT, are well-suited for tool
First of all, conventional recommendation models generally learning applications [217]. With their powerful natural
use id or categorical features as input, while users always language processing capabilities, LLMs can break down
give their requirements or preferences in natural language in complex tasks into smaller sub-tasks and convert them into
conversations. Therefore, unstructured text features should executable instructions. Specialized tools allow LLMs to
be taken into consideration in tool construction. In Chat- access knowledge that is beyond their own understanding.
Rec [209], a conventional recommendation model and a text By integrating specialized tools, LLMs can better understand
embedding-based model(text-embedding-ada-002) are used and address complex problems, offering more accurate and
as tools. RecLLM [210] adapted a language model enhanced efficient solutions.
dual-encoder model and several text retrieval methods as LLMs are commonly applied as controllers to select and
the recommendation engine. On the other hand, despite manage various existing AI models to solve complex tasks,
13

TABLE 3
LLM-based tool learning approaches

Approach Tool usage LLM backbone Task


gpt3-instruct-175B
Re3[218] LLM Long Stories Generation
gpt3-instruct-13B
PEER[219] LLM LM-Adapted T5 Editions, Citations, Quotes
Pretrained Encoders Transformer language-only tasks
METALM[220]
with diverse modalities (pretrained from scratch) vision-language tasks
Knowledge-Intensive Language Tasks
Massively-Multitask Language Understanding
Atlas[221] Dense retriever T5
Question Answering
Fact Checking
Retriever
LaMDA[222] Translator Decoder-only Transformer Dialog
Calculator
WebGPT[223] Web Browser gpt-3 Question answering
Physics Engine gpt-3
Mind’s Eye[224] Reasoning
Text-to-code LM PaLM
Mathematical
PAL[225] Python Interpreter CODEX(code-davinci-002) Symbolic
Algorithmic Reasoning
SayCan[226] Robots PaLM Real-world robotic tasks
Image Classification
gpt-3.5-turbo
AI models Image Captioning
HuggingGPT[227] text-davinci-003
in Hugging Face Community Object Detection
gpt-4
etc.
gpt-3.5-turbo
Auto-GPT Web Browser text-davinci-003 User-specified Tasks
gpt-4
Visual ChatGPT[228] Visual Foundation models
text-davinci-003 Visual Customized Tasks
Taskmatrix.AI[229] Customized models with unified API form
Question Answering
ReAct[214] Wikipedia API PaLM-540B
Face Verificaiton
Calculator
Q&A system
Toolformer[230] Search Engine GPT-J Downstream Tasks
Translation System
Calendar

which rely on user input and language interfaces on making emulate the complete process of human interaction with a
summarizations. They act as the central component, respon- web browser using behavior cloning and rejection sampling
sible for comprehending problem statements and making techniques. In ReAct [214], by leveraging an intuitive prompt,
decisions regarding which actions to execute. Additionally, LLMs learn to generate both reasoning paths and task-
they aggregate the outcomes based on the results of the specific actions alternately when solving a specific task. The
executed actions. In that case, HuggingGPT [227] leverages execution of specific actions is delegated to corresponding
existing models from the Hugging Face community1 to assist tools, and external feedback obtained from these tools is
in task-solving. Visual ChatGPT[228] combines visual foun- utilized to validate and guide the reasoning process further.
dation models like BLIP [231], Stable Diffusion [232], etc. with The motivation behind Toolformer [230] aligns closely with
LangChain2 to handle complex visual tasks, while the fol- ReAct; however, it goes a step further by combining diverse
lowing TaskMatrix.AI [229] maintains a unified API Platform tools within a single model. This integration provides the
extending the capabilities of Visual ChatGPT, extends the model with flexible decision-making abilities and improved
capabilities of Visual ChatGPT by maintaining a unified API generalization capabilities, achieved through a simple yet
Platform, enabling input from multiple modalities and gen- effective self-supervised method. In contrast to prior works,
erating more complex task solutions. On the contrary, Auto- LATM [233] takes a novel approach by empowering LLMs to
GPT3 operates as an agent that autonomously understands directly generate tools. It achieves a division of labor within
specific targets through natural language and performs all the task-solving process by employing LLMs at different
processes in an automated loop, without requiring manda- scales: the tool maker, tool user, and dispatcher. LATM is
tory human input. WebGPT [223] introduces a text-based entirely composed of LLMs, enabling the self-generation and
web browsing interactive environment, where LLMs learn to self-utilization of tools.
1. https://huggingface.co
2. https://docs.langchain.com
3. https://github.com/Significant-Gravitas/Auto-GPT
14

9.2 Applications in Personalization Scenarios the temporal generalization problem of LLMs, a vector
database is utilized to provide information for new items
Recently, LLMs have demonstrated impressive abilities in that the LLMs are unaware of in Chat-REC [209]. When
leveraging internal world knowledge and common sense rea- encountering new items, LLMs can utilize this database
soning to accurately understand user intent from dialogues. to access information about them based on the similarity
Moreover, LLMs can communicate with users fluently in between the user’s request embedding and item embeddings
natural language, offering a seamless and delightful user in the database. User profiles can also help LLMs better
experience. These advantages make LLMs an appealing understand the user’s intent. RecLLM[210] employs a user
choice as recommendation agents to enhance the person- profile module as a tool to deposit meaningful and enduring
alized experience. facts about users exposed during historical conversations in
However, despite the impressive memory capacity of user memory and retrieve a single fact related to the current
LLMs, they face challenges in memorizing specific knowl- dialogue when necessary.
edge in private and specialized domains without sufficient Although some works have applied the concept of tool
training. For instance, storing the item corpus and all learning to personalization systems, there are still interesting
user profiles in a recommender system can be challenging and promising research topics that deserve exploration.
for LLMs. This limitation can result in LLMs generating 1) Fine-tuning models for better tool use. In-context
inaccurate or incorrect responses and makes it difficult to learning has shown promise in teaching LLMs how to
control their behavior within a specific domain. Furthermore, effectively use tools with a small number of demonstrations,
LLMs face the challenge of the temporal generalization problem as shown in Chat-REC and RecLLM. However, LLMs often
as external knowledge continues to evolve and change struggled to learn strategies for handling complex contexts
over time. To address these issues, various tools can be with limited demonstrations. Fine-tuning is a viable option
utilized to augment LLMs and enhance their effectiveness as for improving tool use, but it requires sufficient training
recommendation agents. data and effective techniques. RecLLM further fine-tunes
Search engine. Search engines are widely employed some modules of it using synthetic data generated by a
to provide external knowledge to LLMs, reducing LLMs’ user simulator through RLHF[207] technique. Investigating
memory burden and alleviating the occurrence of hallucina- methods to obtain sufficient training data and developing
tions in LLMs’ responses. BlenderBot 3 [234] uses specific tailored fine-tuning techniques for recommendation systems
datasets to fine-tune a series of modules, enabling LLMs to is a worthwhile research direction. 2) Developing a more
learn to invoke the search engine at the appropriate time powerful recommendation engine. Traditional recommenda-
and extract useful knowledge from the retrieval results. tion systems often rely on collaborative filtering signals and
LaMDA [222] learns to use a toolset that includes an IR item-to-item transition relationships for recommendations.
system, a translator, and a calculator through fine-tuning to However, with the use of LLMs as the foundation models,
generate more factual responses. RETA-LLM [235] is a toolkit user preferences can be reflected through natural language
for retrieval-augmented LLMs. It disentangles IR systems and even images. Therefore, developing a recommendation
and LLMs entirely, facilitating the development of in-domain engine that supports multimodal data is a crucial research
LLM-based systems. [222] shows a case of applying LaMDA direction. Additionally, the recommendation engine should
to content recommendation. Preconditioned on a few role- also be capable of adjusting the candidate set based on user
specific dialogues, LaMDA can play the role of a music preferences or feedback (such as querying movies of a specific
recommendation agent. genre or disliking an item in the recommendation set). 3)
Recommendation engine. Some works have attempted Building more tools. To provide LLMs with more authentic
to alleviate the memory burden of LLMs by equipping them and personalized information, the development of additional
with a recommendation engine as a tool, enabling LLMs to tools is crucial. For example, APIs for querying knowledge
offer recommendations grounded on the item corpus. The graphs [236] or accessing users’ social relationships can
recommendation engine in Chat-REC[209] is further divided enhance the knowledge available to LLMs, enabling more
into two stages: retrieve and reranking, which aligns with accurate and tailored recommendations.
typical recommendation system strategies. In the retrieval
stage, LLMs utilize traditional recommendation systems as
tools to retrieve 20 items from the item corpus as a candidate 10 LLM S AS P ERSONALIZED C ONTENT C REATOR
item set. Subsequently, LLMs employ themselves as tools Traditional recommender systems focus on suggesting
to rerank the candidate item set. LLMs’ commonsense existing items based on user preferences and historical
reasoning ability, coupled with the internal world knowledge data, where displayed content is already generated for
within them, allow them to provide explanations for the retrieval. However, with the advancements in techniques
sorting results. The recommendation engine tool used in and platforms for content creators, personalized content
RecLLM[210] is highly similar to it in Chat-REC, and it is creator has attracted more and more attention, where more
also divided into retrieval and reranking stages. RecLLM appealing content is customized generated to match the
provides several practical solutions for large-scale retrievals, user’s interests and preferences, especially in the realm
such as Generalized Dual Encoder Model and Concept Based of online advertising [237]. The common contents contain
Search, and so on. the visual and semantic contents [238], [239], [240], such
Database. Databases are also utilized as tools to supple- as title, abstract, description, copywritings, ad banners,
ment additional information for LLMs. In order to better thumbnail, and videos. One more widely discussed topic
cope with the cold-start problem for new items and alleviate is text ad generation, where the ad title and ad description
15

are generated with personalized information. Earlier works where the generation process receives feedback and multiple
adopt the pre-defined templates [241], [242], [238] to reduce rounds of conversions to better capture the user explicit
the extensive human effort, which, however, often fail to preferences. Compared to previous training paradigms, more
fully meet the user’s interests and preferences. More recent explicit expressions of user interest can be understood by
data-driven methods have emerged, which incorporate user the large language models and converted into corresponding
feedback as rewards in the reinforcement learning framework instructions to guide the generation of content, significantly
to guide the generation process [243], [244], [245], [246]. Fur- alleviating the problem of extremely sparse feedback.
thermore, the incorporation of pre-trained language models However, there are two major security and privacy risks
has played a significant role in improving the generation for personalized content creators. One of the concerns is the
process for multiple content items [247], [248], [249], [250]. reliability of models like ChatGPT in terms of factuality, as
This integration helps refine the content generation models indicated in the work [268]. While these models generate
and improve their ability to meet user preferences effectively. content that appears reasonable, there is a risk of distributing
As recommender systems and large language models misleading or inaccurate information, which can weaken the
continue to evolve, a promising technique that would bring truthfulness of internet content. This concern becomes partic-
new opportunities is the integration of AI Generated Content ularly crucial in personalized recommendations, where the
(AIGC). AIGC [251] involves the creation of digital content, model may inadvertently promote misleading information
such as images, music and natural language through AI tailored to the user’s interests. The second concern revolves
models, with the aim of making the content creation process around data privacy, encompassing both user profiles and
more efficient and accessible. Earlier efforts in this field long-term human interaction histories. In the case of large
focused on deep-learning-based generative models, includ- language models, these interaction histories are collected or
ing Generative Adversarial Networks (GANs) [252], Varia- shared, potentially leading to the large models memorizing
tional AutoEncoders (VAEs) [253], Normalizing Flows [254], sensitive user data. Previous work [269] has demonstrated
and diffusion-based models [255] for high-quality image that large language models, especially GPT-2 [270], memorize
generation. As the generative model evolves, it eventually and leak individual training examples. This emphasizes
emerges as the transformer architecture [23], acting as the the need for strict user approval and careful handling of
foundational blocks for BERT [256] and GPT [206] in the annotator data to mitigate privacy risks. It is crucial to
field of NLP, and for Vision Transformer (ViT) [257] and develop new techniques that prioritize privacy preservation
Swin Transformer [258] in the field of CV. Moreover, the during the training process.
scope of generation tasks expanded from uni-modal to multi-
modal tasks, including the representative model CLIP [259],
which can be used as image encoders with multi-modal 11 O PEN C HALLENGES
prompting for generation. The multi-modal generation has
11.1 Industrial Challenges
become an essential aspect of AIGC, which learns the
multimodal connection and interaction, typically including Personalization services, particularly with recommender
vision language generation [259], text audio generation [260], systems, are complex industrial products that face numerous
text graph generation [261], text Code Generation [262]. With challenges when implemented in real-world scenarios. We
the emergence of large language models, nowadays AIGC is will now summarize the key challenges as follows:
achieved by extracting the human intention from instructions Scaling computational resources Existing large language
and generating the content according to its knowledge and models, such as BERT and GPT, demand significant compu-
intention. Representative products, including ChatGPT [263], tational power for training and inference. This includes high
DALL-E-2 [264], Codex [265] and Midjourney [266], have memory usage and time consumption. Fine-tuning these
attaining significant attention from society. With the growth models to align them with personalization systems, which
of data and model size, the model can learn more compre- has shown promising results for improved personalization
hensive information and thus leading to more realistic and performance, can be computationally intensive. Several
high-quality content creators. efficient finetuning strategies, e.g., option tuning in M6-
Recall to the personalized content creator, the large lan- Rec [156], Lora [271], QLora [272], have been developed
guage models would bring opportunities from the following to address this issue and pave the way for more efficient
points. Large language models would further extend the tuning.
capabilities of the pre-trained model, allowing for better Significant Response time Achieving efficient response
reasoning of user personalized intent and interest. Previous times is crucial for online serving and greatly impacts
methods [248], [249] depending on tailored pre-training the personalized user experience. Response time includes
models may be enhanced to better improve the reasoning both the inference phase of large language models and the
abilities and few-shot prompting. Secondly, Reinforcement concurrent user requests in large numbers. The introduction
Learning from Human Feedback (RLHF) strategy can be ap- of large language models can result in considerable inference
plied to fine-tune models to better capture the user intent time, posing a challenge for real-world deployment. One
information, similar to existing RL-based framework [244] approach is to pre-compute the embeddings of intermediate
for text ad generation. Last but not least, the powerful outputs from language models, storing and indexing them
generative abilities of large language models empower in a vector database, particularly for methods that utilize
realistic creation thanks to the availability of sufficient cross- large language models as textual encoders. Other approaches,
modal knowledge bases. The work [267] more specifically such as distillation and quantization, aim to strike a balance
proposes a recommendation paradigm based on ChatGPT, between performance and latency.
16

11.2 Laborious Data Collection trend toward leveraging textual information more exten-
sively. Textual data provides unique insights about items or
Large language models are widely known to leverage
users, making it valuable for modeling purposes. From the
extensive amounts of open-domain knowledge during their
perspective of modeling, dealing with long text data requires
training and fine-tuning processes. These knowledge sources
more attention and complexity compared to categorical data,
include well-known references such as Wikipedia, books, and
not to mention the need to match the modeling of user
various websites [3]. Similarly, when applied in recommender
interests. From the perspective of implementation, reforming
systems, these models often rely on representative open-
the entire pipeline becomes necessary to accommodate the
domain datasets such as MovieLens and Amazon Books.
requirements of efficient latency. Efficiently processing and
While this type of open-domain knowledge contains a wealth
incorporating long text data into recommendation models
of common-sense information, personalized tasks require ac-
and serving them in real-time present technical challenges.
cess to more domain-specific data that is not easily shareable.
Additionally, the nature of user feedback in personalized
tasks can be complex and sparse, often accompanied by 11.4 Interpretability and Explainability
noisy feedback. Collecting and filtering this data, in contrast While large language models provide good reasoning capa-
to acquiring common-sense knowledge, presents challenges. bilities, they are notorious for the nature of the ’black box’,
It incurs higher labor costs and introduces additional training which is highly complex and non-linear in their enormous
redundancy due to the need for extensive data processing size and layered architecture, making it challenging to
and filtering. Furthermore, designing appropriate prompts comprehend the internal workings and understand the
to instruct or fine-tune large language models is crucial for generation process of recommendations. Without a deep
aligning them with the distribution of in-domain inputs in understanding of how the model operates, it becomes
personalization tasks. By carefully tailoring the prompts, challenging to detect and address biases or ensure fair
researchers and practitioners can guide the model to produce and ethical recommendations. Once transparency about the
outputs that better cater to personalized applications, thereby internal mechanisms is lacking, users struggle to trust and
maximizing performance and effectiveness. accept the decisions made by the system. Users often de-
sire understandable explanations for recommended choices.
Addressing the challenge of model interpretability and
11.3 Long Text Modeling explainability requires research involving natural language
processing, explainable AI, human-computer interaction, and
Large language models have a limitation on the maximum
recommendation systems. The development of techniques
number of input tokens they can handle, typically con-
that unveil the inner workings of language models, facilitate
strained by the context window size, e.g., 4096 for ChatGPT.
the generation of meaningful and accurate interpretations,
This poses challenges when dealing with long user behavior
and enable robust evaluation methods is the main focus. By
sequences, which are common in modern recommender
providing transparent and interpretable recommendations,
systems. Careful design is necessary to generate effective
users can establish trust, understand the reasoning behind
and appropriate prompt inputs within this limited length.
the recommendations, and make informed decisions.
In the case of conversations with multiple rounds, accu-
mulating several rounds of dialogue can easily exceed the
token limit of models. The current approach in handling 11.5 Evaluation
long conversations is to truncate the history, keeping only Conventional personalization systems typically rely on task-
the most recent tokens. However, this truncation discards specific metrics such as ranking-oriented metrics, NDCG,
valuable historical information, potentially harming the AUC, and Recall to evaluate model performance. However,
model performance. To address these challenges, several with the integration of large language models into recom-
techniques can be employed. One approach is to prioritize mender systems, the evaluation tools and metrics undergo
and select the most relevant parts of the user behavior significant changes. Traditional metrics may not sufficiently
sequence or conversation history to include in the prompt. capture the performance of recommender systems powered
This selection can be based on various criteria such as by large language models, which introduce novel capabilities
recency, importance, or relevance to the task at hand. Another and generate recommendations in a different manner and
technique involves summarizing or compressing the lengthy require the development of new evaluation tools.
input while preserving essential information. This can be One crucial aspect of evaluation is considering user
achieved through techniques like extractive summarization preferences in large language model-powered systems, which
or representing the long sequence in a condensed form. requires a user-centric approach. Metrics such as user satisfac-
Moreover, architectural modifications, such as hierarchical tion, engagement, and overall experience become essential
or memory-augmented models, can be explored to better considerations. For example, Liu’s work [143] proposes a
handle long sequences by incorporating mechanisms to store crowdsourcing task to assess the quality of generated expla-
and retrieve relevant information efficiently. nations and review summaries, providing a way to evaluate
In addition, collaborative modeling of long text data and the effectiveness of the generated content. Additionally, user
recommendation tasks is an emerging and pressing challenge. satisfaction surveys and feedback questionnaires can serve
In conventional personalization systems, item ID information as valuable options.
along with other categorical information is commonly used Another perspective to consider is the health of the
for modeling feature interactions and user preferences. With system, which involves evaluating novelty and assessing
the rise of large language models, there would be a growing factors like diversity, novelty, serendipity, and user retention
17

rates. These metrics help evaluate the freshness of recommen- ligence. Their enhanced abilities in understanding, language
dations and the long-term effects of large language models. analysis, and common-sense reasoning have opened up new
Furthermore, it is crucial to assess the interpretability and possibilities for personalization. In this paper, we provide
fairness of recommendations. The interpretability assessment several perspectives on when large language models adapt
focuses on measuring the clarity, understandability, and trans- to personalization systems. We have observed a progression
parency of recommendations. Simultaneously, the fairness from utilizing low-level capabilities of large language models
evaluation aims to address potential biases in personalized to enhance performance, to leveraging their potential in
results. By prioritizing fairness, we strive to create personal- complex interactions with external tools for end-to-end
ized experiences that are equitable and inclusive for all users. tasks. This evolution promises to revolutionize the way
Both of these evaluations are essential to enhance the overall personalized services are delivered. We also acknowledge
user experience and build confidence in the personalized the open challenges that come with the integration of large
recommendations delivered by the system. language models into personalization systems.

11.6 Trade-off between Helpfulness, Honesty, Harmless-


ness
When large language models are employed for personal-
R EFERENCES
ization, some of their disadvantages would be magnified. [1] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min,
Striving for a more honest and harmless system may come B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language
at the expense of system performance. models,” arXiv preprint arXiv:2303.18223, 2023.
First of all, the accuracy and factuality of the system must [2] J. Huang and K. C.-C. Chang, “Towards reasoning in large
language models: A survey,” arXiv preprint arXiv:2212.10403, 2022.
be ensured. Although large language models can generate
[3] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhari-
seemingly reasonable content, there is a risk of disseminating wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Lan-
misleading or inaccurate information. This becomes even guage models are few-shot learners,” Advances in neural information
more critical when incorporating user feedback, as the model processing systems, vol. 33, pp. 1877–1901, 2020.
[4] A. Salemi, S. Mysore, M. Bendersky, and H. Zamani, “Lamp:
may mimic user behaviors in an attempt to appear honest. When large language models meet personalization,” arXiv preprint
However, this imitation can result in biased guidance for arXiv:2304.11406, 2023.
users, offering no real benefits. [5] L. Wu, Z. Zheng, Z. Qiu, H. Wang, H. Gu, T. Shen, C. Qin, C. Zhu,
Secondly, in terms of harmlessness, concerns regard- H. Zhu, Q. Liu et al., “A survey on large language models for
recommendation,” arXiv preprint arXiv:2305.19860, 2023.
ing privacy, discrimination, and ethics arise. While large
[6] J. Lin, X. Dai, Y. Xi, W. Liu, B. Chen, X. Li, C. Zhu, H. Guo, Y. Yu,
language models have the potential to provide highly R. Tang et al., “How can recommender systems benefit from large
personalized recommendations by leveraging user data, language models: A survey,” arXiv preprint arXiv:2306.05817, 2023.
privacy, and data security become paramount. Unlike open- [7] W. Fan, Z. Zhao, J. Li, Y. Liu, X. Mei, Y. Wang, J. Tang, and Q. Li,
“Recommender systems in the era of large language models (llms),”
domain datasets, the privacy of individual data used for arXiv preprint arXiv:2307.02046, 2023.
training should be rigorously protected, with strict user [8] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl,
permissions for sharing their personal information. For “Grouplens: An open architecture for collaborative filtering of
discrimination, large language models may inevitably reflect netnews,” in Proceedings of the 1994 ACM conference on Computer
supported cooperative work, 1994, pp. 175–186.
biases inherent in the training data, leading to discriminatory [9] R. Pan, Y. Zhou, B. Cao, N. N. Liu, R. Lukose, M. Scholz, and
recommendations. Considering the biased user and item Q. Yang, “One-class collaborative filtering,” in 2008 Eighth IEEE
distribution, which is much more significant in recommender international conference on data mining. IEEE, 2008, pp. 502–511.
systems with the long-tail effect, where biased user and item [10] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques
for recommender systems,” Computer, vol. 42, no. 8, pp. 30–37,
distribution can lead to decisions that favor majority choices, 2009.
resulting in discrimination against certain users. The final [11] J. Wang, A. P. De Vries, and M. J. Reinders, “Unifying user-based
concern revolves around ethical considerations. Harmful and item-based collaborative filtering approaches by similarity
messages, if clicked by users unconsciously, can guide large fusion,” in Proceedings of the 29th annual international ACM SIGIR
conference on Research and development in information retrieval, 2006,
language models toward generating similar harmful content. pp. 501–508.
However, when assisting in personalized decision-making, it [12] M. J. Pazzani and D. Billsus, “Content-based recommendation
is essential for large language models to have the capability systems,” in The adaptive web: methods and strategies of web personal-
to minimize exposure to harmful messages and guide users ization. Springer, 2007, pp. 325–341.
[13] C. Wang and D. M. Blei, “Collaborative topic modeling for
in a responsible manner. Approaches like constructing a recommending scientific articles,” in Proceedings of the 17th ACM
Constitutional AI [273], where critiques, revisions, and SIGKDD international conference on Knowledge discovery and data
supervised Learning are adopted for better training large mining, 2011, pp. 448–456.
language models, may offer valuable insights. [14] F. Zhang, N. J. Yuan, D. Lian, X. Xie, and W.-Y. Ma, “Collabora-
tive knowledge base embedding for recommender systems,” in
By addressing these concerns, safeguarding privacy, Proceedings of the 22nd ACM SIGKDD international conference on
mitigating discrimination, and adhering to ethical guidelines, knowledge discovery and data mining, 2016, pp. 353–362.
recommender systems can leverage the power of large [15] H. Liu, F. Wu, W. Wang, X. Wang, P. Jiao, C. Wu, and X. Xie,
“Nrpa: neural recommendation with personalized attention,” in
language models while ensuring user trust, fairness, and
Proceedings of the 42nd International ACM SIGIR Conference on
responsible recommendations. Research and Development in Information Retrieval, 2019, pp. 1233–
1236.
12 C ONCLUSION [16] H. Wang, N. Wang, and D.-Y. Yeung, “Collaborative deep learning
for recommender systems,” in Proceedings of the 21th ACM SIGKDD
In conclusion, the emergence of large language models repre- international conference on knowledge discovery and data mining, 2015,
sents a significant breakthrough in the field of artificial intel- pp. 1235–1244.
18

[17] G. Zhou, X. Zhu, C. Song, Y. Fan, H. Zhu, X. Ma, Y. Yan, J. Jin, mendation,” in Proceedings of the 25th ACM SIGKDD international
H. Li, and K. Gai, “Deep interest network for click-through rate conference on knowledge discovery & data mining, 2019, pp. 1891–1899.
prediction,” in Proceedings of the 24th ACM SIGKDD international [37] J. Zhao, Z. Zhou, Z. Guan, W. Zhao, W. Ning, G. Qiu, and
conference on knowledge discovery & data mining, 2018, pp. 1059–1068. X. He, “Intentgc: a scalable graph convolution framework fusing
[18] G. Zhou, N. Mou, Y. Fan, Q. Pi, W. Bian, C. Zhou, X. Zhu, heterogeneous information for recommendation,” in Proceedings
and K. Gai, “Deep interest evolution network for click-through of the 25th ACM SIGKDD International Conference on Knowledge
rate prediction,” in Proceedings of the AAAI conference on artificial Discovery & Data Mining, 2019, pp. 2347–2357.
intelligence, vol. 33, no. 01, 2019, pp. 5941–5948. [38] F. Petroni, T. Rocktäschel, P. Lewis, A. Bakhtin, Y. Wu, A. H. Miller,
[19] X. Wang, X. He, M. Wang, F. Feng, and T.-S. Chua, “Neural graph and S. Riedel, “Language models as knowledge bases?” arXiv
collaborative filtering,” in Proceedings of the 42nd international ACM preprint arXiv:1909.01066, 2019.
SIGIR conference on Research and development in Information Retrieval, [39] A. Roberts, C. Raffel, and N. Shazeer, “How much knowledge can
2019, pp. 165–174. you pack into the parameters of a language model?” arXiv preprint
[20] W.-C. Kang and J. McAuley, “Self-attentive sequential recommen- arXiv:2002.08910, 2020.
dation,” in 2018 IEEE international conference on data mining (ICDM). [40] F. Petroni, P. Lewis, A. Piktus, T. Rocktäschel, Y. Wu, A. H. Miller,
IEEE, 2018, pp. 197–206. and S. Riedel, “How context affects language models’ factual
[21] B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk, “Session- predictions,” in Automated Knowledge Base Construction.
based recommendations with recurrent neural networks,” arXiv [41] Z. Jiang, F. F. Xu, J. Araki, and G. Neubig, “How can we know
preprint arXiv:1511.06939, 2015. what language models know?” Transactions of the Association for
[22] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, and P. Jiang, Computational Linguistics, vol. 8, pp. 423–438, 2020.
“Bert4rec: Sequential recommendation with bidirectional encoder [42] C. Wang, X. Liu, and D. Song, “Language models are open
representations from transformer,” in Proceedings of the 28th ACM knowledge graphs,” arXiv preprint arXiv:2010.11967, 2020.
international conference on information and knowledge management, [43] N. Poerner, U. Waltinger, and H. Schütze, “E-bert: Efficient-yet-
2019, pp. 1441–1450. effective entity embeddings for bert,” in Findings of the Association
[23] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. for Computational Linguistics: EMNLP 2020, 2020, pp. 803–818.
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” [44] B. Heinzerling and K. Inui, “Language models as knowledge bases:
Advances in neural information processing systems, vol. 30, 2017. On entity representations, storage capacity, and paraphrased
[24] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, queries,” in Proceedings of the 16th Conference of the European Chapter
R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling of the Association for Computational Linguistics: Main Volume, 2021,
laws for neural language models,” arXiv preprint arXiv:2001.08361, pp. 1772–1791.
2020. [45] C. Wang, P. Liu, and Y. Zhang, “Can generative pre-trained
[25] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, language models serve as knowledge bases for closed-book qa?”
E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark in Proceedings of the 59th Annual Meeting of the Association for
et al., “Training compute-optimal large language models,” arXiv Computational Linguistics and the 11th International Joint Conference
preprint arXiv:2203.15556, 2022. on Natural Language Processing (Volume 1: Long Papers), 2021, pp.
[26] H. Wang, F. Zhang, X. Xie, and M. Guo, “Dkn: Deep knowledge- 3241–3251.
aware network for news recommendation,” in Proceedings of the [46] K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang, “Retrieval aug-
2018 world wide web conference, 2018, pp. 1835–1844. mented language model pre-training,” in International conference
[27] J. Huang, W. X. Zhao, H. Dou, J.-R. Wen, and E. Y. Chang, “Im- on machine learning. PMLR, 2020, pp. 3929–3938.
proving sequential recommendation with knowledge-enhanced [47] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and
memory networks,” in The 41st international ACM SIGIR conference O. Yakhnenko, “Translating embeddings for modeling multi-
on research & development in information retrieval, 2018, pp. 505–514. relational data,” Advances in neural information processing systems,
[28] H. Wang, F. Zhang, M. Hou, X. Xie, M. Guo, and Q. Liu, vol. 26, 2013.
“Shine: Signed heterogeneous information network embedding [48] Y. Zhu, X. Wang, J. Chen, S. Qiao, Y. Ou, Y. Yao, S. Deng, H. Chen,
for sentiment link prediction,” in Proceedings of the eleventh ACM and N. Zhang, “Llms for knowledge graph construction and
international conference on web search and data mining, 2018, pp. reasoning: Recent capabilities and future opportunities,” arXiv
592–600. preprint arXiv:2305.13168, 2023.
[29] X. Yu, X. Ren, Q. Gu, Y. Sun, and J. Han, “Collaborative filtering [49] Z. Zhang, X. Liu, Y. Zhang, Q. Su, X. Sun, and B. He, “Pretrain-
with entity similarity regularization in heterogeneous information kge: learning knowledge representation from pretrained language
networks,” IJCAI HINA, vol. 27, 2013. models,” in Findings of the Association for Computational Linguistics:
[30] C. Shi, Z. Zhang, P. Luo, P. S. Yu, Y. Yue, and B. Wu, “Semantic path EMNLP 2020, 2020, pp. 259–266.
based personalized recommendation on weighted heterogeneous [50] A. Kumar, A. Pandey, R. Gadia, and M. Mishra, “Building
information networks,” in Proceedings of the 24th ACM International knowledge graph using pre-trained language model for learning
on Conference on Information and Knowledge Management, 2015, pp. entity-aware relationships,” in 2020 IEEE International Conference
453–462. on Computing, Power and Communication Technologies (GUCON).
[31] W. Ma, M. Zhang, Y. Cao, W. Jin, C. Wang, Y. Liu, S. Ma, and IEEE, 2020, pp. 310–315.
X. Ren, “Jointly learning explainable rules for recommendation [51] B. Kim, T. Hong, Y. Ko, and J. Seo, “Multi-task learning for
with knowledge graph,” in The world wide web conference, 2019, pp. knowledge graph completion with pre-trained language models,”
1210–1221. in Proceedings of the 28th International Conference on Computational
[32] X. Huang, Q. Fang, S. Qian, J. Sang, Y. Li, and C. Xu, “Explainable Linguistics, 2020, pp. 1737–1743.
interaction-driven user modeling over knowledge graph for [52] B. Choi, D. Jang, and Y. Ko, “Mem-kgc: Masked entity model for
sequential recommendation,” in proceedings of the 27th ACM knowledge graph completion with pre-trained language model,”
international conference on multimedia, 2019, pp. 548–556. IEEE Access, vol. 9, pp. 132 025–132 032, 2021.
[33] H. Wang, F. Zhang, J. Wang, M. Zhao, W. Li, X. Xie, and M. Guo, [53] B. Wang, T. Shen, G. Long, T. Zhou, Y. Wang, and Y. Chang,
“Ripplenet: Propagating user preferences on the knowledge graph “Structure-augmented text representation learning for efficient
for recommender systems,” in Proceedings of the 27th ACM knowledge graph completion,” in Proceedings of the Web Conference
international conference on information and knowledge management, 2021, 2021, pp. 1737–1748.
2018, pp. 417–426. [54] X. Xie, N. Zhang, Z. Li, S. Deng, H. Chen, F. Xiong, M. Chen,
[34] H. Wang, M. Zhao, X. Xie, W. Li, and M. Guo, “Knowledge graph and H. Chen, “From discrimination to generation: knowledge
convolutional networks for recommender systems,” in The world graph completion with generative transformer,” in Companion
wide web conference, 2019, pp. 3307–3313. Proceedings of the Web Conference 2022, 2022, pp. 162–165.
[35] X. Wang, X. He, Y. Cao, M. Liu, and T.-S. Chua, “Kgat: Knowledge [55] P. Jiang, S. Agarwal, B. Jin, X. Wang, J. Sun, and J. Han, “Text-
graph attention network for recommendation,” in Proceedings of the augmented open knowledge graph completion via pre-trained
25th ACM SIGKDD international conference on knowledge discovery language models,” arXiv preprint arXiv:2305.15597, 2023.
& data mining, 2019, pp. 950–958. [56] H. Yan, T. Gui, J. Dai, Q. Guo, Z. Zhang, and X. Qiu, “A unified
[36] X. Tang, T. Wang, H. Yang, and H. Song, “Akupm: Attention- generative framework for various ner subtasks,” in Proceedings
enhanced knowledge-aware user preference model for recom- of the 59th Annual Meeting of the Association for Computational
19

Linguistics and the 11th International Joint Conference on Natural [78] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation
Language Processing (Volume 1: Long Papers), 2021, pp. 5808–5822. of word representations in vector space,” in ICLR (Workshop Poster),
[57] B. Li, W. Yin, and M. Chen, “Ultra-fine entity typing with indirect 2013.
supervision from natural language inference,” Transactions of the [79] C. Wu, F. Wu, M. An, J. Huang, Y. Huang, and X. Xie, “NPA:
Association for Computational Linguistics, vol. 10, pp. 607–622, 2022. neural news recommendation with personalized attention,” in
[58] Y. Kirstain, O. Ram, and O. Levy, “Coreference resolution without KDD. ACM, 2019, pp. 2576–2584.
span representations,” in Proceedings of the 59th Annual Meeting of [80] M. An, F. Wu, C. Wu, K. Zhang, Z. Liu, and X. Xie, “Neural news
the Association for Computational Linguistics and the 11th International recommendation with long- and short-term user representations,”
Joint Conference on Natural Language Processing (Volume 2: Short in ACL (1). Association for Computational Linguistics, 2019, pp.
Papers), 2021, pp. 14–19. 336–345.
[59] A. Cattan, A. Eirew, G. Stanovsky, M. Joshi, and I. Dagan, “Cross- [81] C. Wu, F. Wu, S. Ge, T. Qi, Y. Huang, and X. Xie, “Neural news rec-
document coreference resolution over predicted mentions,” in ommendation with multi-head self-attention,” in EMNLP/IJCNLP
Findings of the Association for Computational Linguistics: ACL-IJCNLP (1). Association for Computational Linguistics, 2019, pp. 6388–
2021, 2021, pp. 5100–5107. 6393.
[60] S. Lyu and H. Chen, “Relation classification with entity type re- [82] C. Wu, F. Wu, T. Qi, and Y. Huang, “User modeling with click
striction,” in Findings of the Association for Computational Linguistics: preference and reading satisfaction for news recommendation,” in
ACL-IJCNLP 2021, 2021, pp. 390–395. IJCAI. ijcai.org, 2020, pp. 3023–3029.
[61] H. Wang, C. Focke, R. Sylvester, N. Mishra, and W. Wang, [83] D. Khattar, V. Kumar, V. Varma, and M. Gupta, “Weave&rec:
“Fine-tune bert for docred with two-step process,” arXiv preprint A word embedding based 3-d convolutional network for news
arXiv:1909.11898, 2019. recommendation,” in CIKM. ACM, 2018, pp. 1855–1858.
[62] J. Han, N. Collier, W. Buntine, and E. Shareghi, “Pive: Prompting [84] Q. Zhu, X. Zhou, Z. Song, J. Tan, and L. Guo, “DAN: deep attention
with iterative verification improving graph-based generative neural network for news recommendation,” in AAAI. AAAI Press,
capability of llms,” arXiv preprint arXiv:2305.12392, 2023. 2019, pp. 5973–5980.
[63] M. Trajanoska, R. Stojanov, and D. Trajanov, “Enhancing knowl- [85] Z. Qiu, X. Wu, J. Gao, and W. Fan, “U-bert: Pre-training user
edge graph construction using large language models,” arXiv representations for improved recommendation,” in Proceedings of
preprint arXiv:2305.04676, 2023. the AAAI Conference on Artificial Intelligence, vol. 35, no. 5, 2021, pp.
[64] P. West, C. Bhagavatula, J. Hessel, J. Hwang, L. Jiang, R. Le Bras, 4320–4327.
X. Lu, S. Welleck, and Y. Choi, “Symbolic knowledge distillation: [86] Q. Zhang, J. Li, Q. Jia, C. Wang, J. Zhu, Z. Wang, and X. He,
from general language models to commonsense models,” in “Unbert: User-news matching bert for news recommendation.” in
Proceedings of the 2022 Conference of the North American Chapter IJCAI, 2021, pp. 3356–3362.
of the Association for Computational Linguistics: Human Language [87] C. Wu, F. Wu, T. Qi, and Y. Huang, “Empowering news recom-
Technologies, 2022, pp. 4602–4625. mendation with pre-trained language models,” in Proceedings
[65] Y. Xi, W. Liu, J. Lin, J. Zhu, B. Chen, R. Tang, W. Zhang, of the 44th International ACM SIGIR Conference on Research and
R. Zhang, and Y. Yu, “Towards open-world recommendation with Development in Information Retrieval, 2021, pp. 1652–1656.
knowledge augmentation from large language models,” arXiv [88] Q. Liu, J. Zhu, Q. Dai, and X. Wu, “Boosting deep ctr prediction
preprint arXiv:2306.10933, 2023. with a plug-and-play pre-trainer for news recommendation,” in
[66] S. Razniewski, A. Yates, N. Kassner, and G. Weikum, “Language Proceedings of the 29th International Conference on Computational
models as or for knowledge bases,” arXiv preprint arXiv:2110.04888, Linguistics, 2022, pp. 2823–2833.
2021. [89] C. Wu, F. Wu, T. Qi, C. Zhang, Y. Huang, and T. Xu, “Mm-rec:
[67] J. Yu, X. Wang, S. Tu, S. Cao, D. Zhang-Li, X. Lv, H. Peng, Visiolinguistic model empowered multimodal news recommenda-
Z. Yao, X. Zhang, H. Li et al., “Kola: Carefully benchmarking tion,” in Proceedings of the 45th International ACM SIGIR Conference
world knowledge of large language models,” arXiv preprint on Research and Development in Information Retrieval, 2022, pp. 2560–
arXiv:2306.09296, 2023. 2564.
[68] D. Ye, Y. Lin, P. Li, and M. Sun, “Packed levitated marker for [90] Y. Yu, F. Wu, C. Wu, J. Yi, and Q. Liu, “Tiny-newsrec: Effective
entity and relation extraction,” in Proceedings of the 60th Annual and efficient plm-based news recommendation,” in Proceedings
Meeting of the Association for Computational Linguistics (Volume 1: of the 2022 Conference on Empirical Methods in Natural Language
Long Papers), 2022, pp. 4904–4917. Processing, 2022, pp. 5478–5489.
[69] K. Lang, “Newsweeder: Learning to filter netnews,” in Machine [91] L. Zou, S. Zhang, H. Cai, D. Ma, S. Cheng, S. Wang, D. Shi,
learning proceedings 1995. Elsevier, 1995, pp. 331–339. Z. Cheng, and D. Yin, “Pre-trained language model based ranking
[70] H. Wang, X. Shi, and D.-Y. Yeung, “Collaborative recurrent in baidu search,” in Proceedings of the 27th ACM SIGKDD Conference
autoencoder: Recommend while learning to fill in the blanks,” on Knowledge Discovery & Data Mining, 2021, pp. 4014–4022.
Advances in Neural Information Processing Systems, vol. 29, 2016. [92] Y. Liu, W. Lu, S. Cheng, D. Shi, S. Wang, Z. Cheng, and D. Yin, “Pre-
[71] X. Dong, L. Yu, Z. Wu, Y. Sun, L. Yuan, and F. Zhang, “A trained language model for web-scale retrieval in baidu search,”
hybrid collaborative filtering model with deep structure for in Proceedings of the 27th ACM SIGKDD Conference on Knowledge
recommender systems,” in Proceedings of the AAAI Conference Discovery & Data Mining, 2021, pp. 3365–3375.
on artificial intelligence, vol. 31, no. 1, 2017. [93] A. Muhamed, I. Keivanloo, S. Perera, J. Mracek, Y. Xu, Q. Cui,
[72] X. Li and J. She, “Collaborative variational autoencoder for S. Rajagopalan, B. Zeng, and T. Chilimbi, “Ctr-bert: Cost-effective
recommender systems,” in Proceedings of the 23rd ACM SIGKDD knowledge distillation for billion-parameter teacher models,” in
international conference on knowledge discovery and data mining, 2017, NeurIPS Efficient Natural Language and Speech Processing Workshop,
pp. 305–314. 2021.
[73] C. Wu, F. Wu, Y. Huang, and X. Xie, “Personalized news rec- [94] J. He, B. Xu, Z. Yang, D. Han, C. Yang, and D. Lo, “Ptm4tag:
ommendation: Methods and challenges,” ACM Trans. Inf. Syst., sharpening tag recommendation of stack overflow posts with pre-
vol. 41, no. 1, pp. 24:1–24:50, 2023. trained models,” in Proceedings of the 30th IEEE/ACM International
[74] Q. V. Le and T. Mikolov, “Distributed representations of sentences Conference on Program Comprehension, 2022, pp. 1–11.
and documents,” in ICML, ser. JMLR Workshop and Conference [95] X. Zhang, Y. Malkov, O. Florez, S. Park, B. McWilliams, J. Han,
Proceedings, vol. 32. JMLR.org, 2014, pp. 1188–1196. and A. El-Kishky, “Twhin-bert: A socially-enriched pre-trained
[75] Y. Song, A. M. Elkahky, and X. He, “Multi-rate deep learning for language model for multilingual tweet representations,” arXiv
temporal recommendation,” in SIGIR. ACM, 2016, pp. 909–912. preprint arXiv:2209.07562, 2022.
[76] V. Kumar, D. Khattar, S. Gupta, M. Gupta, and V. Varma, “Deep [96] S. Rahmani, A. Naghshzan, and L. Guerrouj, “Improving code
neural architecture for news recommendation,” in CLEF (Working example recommendations on informal documentation using
Notes), ser. CEUR Workshop Proceedings, vol. 1866. CEUR- bert and query-aware lsh: A comparative study,” arXiv preprint
WS.org, 2017. arXiv:2305.03017, 2023.
[77] S. Okura, Y. Tagami, S. Ono, and A. Tajima, “Embedding-based [97] H. Ding, Y. Ma, A. Deoras, Y. Wang, and H. Wang, “Zero-shot
news recommendation for millions of users,” in Proceedings of the recommender systems,” arXiv preprint arXiv:2105.08318, 2021.
23rd ACM SIGKDD international conference on knowledge discovery [98] Y. Hou, S. Mu, W. X. Zhao, Y. Li, B. Ding, and J.-R. Wen, “Towards
and data mining, 2017, pp. 1933–1942. universal sequence representation learning for recommender
20

systems,” in Proceedings of the 28th ACM SIGKDD Conference on [120] S. Verma, A. Beniwal, N. Sadagopan, and A. Seshadri, “Recx-
Knowledge Discovery and Data Mining, 2022, pp. 585–593. plainer: Post-hoc attribute-based explanations for recommender
[99] Y. Hou, Z. He, J. McAuley, and W. X. Zhao, “Learning vector- systems,” arXiv preprint arXiv:2211.14935, 2022.
quantized item representation for transferable sequential recom- [121] W. Zhang, J. Yan, Z. Wang, and J. Wang, “Neuro-symbolic
menders,” in Proceedings of the ACM Web Conference 2023, 2023, pp. interpretable collaborative filtering for attribute-based recommen-
1162–1171. dation,” in Proceedings of the ACM Web Conference 2022, 2022, pp.
[100] Z. Yuan, F. Yuan, Y. Song, Y. Li, J. Fu, F. Yang, Y. Pan, and 3229–3238.
Y. Ni, “Where to go next for recommender systems? id-vs. [122] P. Li, Z. Wang, Z. Ren, L. Bing, and W. Lam, “Neural rating
modality-based recommender models revisited,” arXiv preprint regression with abstractive tips generation for recommendation,”
arXiv:2303.13835, 2023. in Proceedings of the 40th International ACM SIGIR conference on
[101] J. Fu, F. Yuan, Y. Song, Z. Yuan, M. Cheng, S. Cheng, J. Zhang, Research and Development in Information Retrieval, 2017, pp. 345–354.
J. Wang, and Y. Pan, “Exploring adapter-based transfer learning for [123] L. Dong, S. Huang, F. Wei, M. Lapata, M. Zhou, and K. Xu, “Learn-
recommender systems: Empirical studies and practical insights,” ing to generate product reviews from attributes,” in Proceedings
arXiv preprint arXiv:2305.15036, 2023. of the 15th Conference of the European Chapter of the Association for
[102] K. Bao, J. Zhang, Y. Zhang, W. Wang, F. Feng, and X. He, Computational Linguistics: Volume 1, Long Papers, 2017, pp. 623–632.
“Tallrec: An effective and efficient tuning framework to align [124] L. Li, Y. Zhang, and L. Chen, “Generate neural template expla-
large language model with recommendation,” arXiv preprint nations for recommendation,” in Proceedings of the 29th ACM
arXiv:2305.00447, 2023. International Conference on Information & Knowledge Management,
[103] W.-C. Kang, J. Ni, N. Mehta, M. Sathiamoorthy, L. Hong, E. Chi, 2020, pp. 755–764.
and D. Z. Cheng, “Do llms understand user preferences? evaluat- [125] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
ing llms on user rating prediction,” arXiv preprint arXiv:2305.06474, Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
2023. [126] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau,
[104] Z. Chen, “Palr: Personalization aware llms for recommendation,” F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase
arXiv preprint arXiv:2305.07622, 2023. representations using rnn encoder-decoder for statistical machine
[105] J. Zhang, R. Xie, Y. Hou, W. X. Zhao, L. Lin, and J.-R. Wen, translation,” arXiv preprint arXiv:1406.1078, 2014.
“Recommendation as instruction following: A large language [127] L. Li, Y. Zhang, and L. Chen, “Personalized transformer for
model empowered recommendation approach,” arXiv preprint explainable recommendation,” in Proceedings of the 59th Annual
arXiv:2305.07001, 2023. Meeting of the Association for Computational Linguistics and the 11th
[106] X. Wang, Y. Chen, J. Yang, L. Wu, Z. Wu, and X. Xie, “A reinforce- International Joint Conference on Natural Language Processing (Volume
ment learning framework for explainable recommendation,” in 1: Long Papers), 2021.
2018 IEEE international conference on data mining (ICDM). IEEE, [128] H. Zhan, L. Li, S. Li, W. Liu, M. Gupta, and A. C. Kot, “Towards ex-
2018, pp. 587–596. plainable recommendation via bert-guided explanation generator,”
[107] J. Gao, X. Wang, Y. Wang, and X. Xie, “Explainable recommenda- in ICASSP 2023-2023 IEEE International Conference on Acoustics,
tion through attentive multi-view learning,” in Proceedings of the Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp.
[129] J. Ni, J. Li, and J. McAuley, “Justifying recommendations using
3622–3629.
distantly-labeled reviews and fine-grained aspects,” in Proceedings
[108] S. Lee, X. Wang, S. Han, X. Yi, X. Xie, and M. Cha, “Self-explaining of the 2019 conference on empirical methods in natural language
deep models with logic rule reasoning,” in Advances in Neural processing and the 9th international joint conference on natural language
Information Processing Systems, 2022. processing (EMNLP-IJCNLP), 2019, pp. 188–197.
[109] M. Nye, A. J. Andreassen, G. Gur-Ari, H. Michalewski, J. Austin,
[130] Z. Liu, Y. Ma, M. Schubert, Y. Ouyang, W. Rong, and Z. Xiong,
D. Bieber, D. Dohan, A. Lewkowycz, M. Bosma, D. Luan et al.,
“Multimodal contrastive transformer for explainable recommenda-
“Show your work: Scratchpads for intermediate computation with
tion,” IEEE Transactions on Computational Social Systems, 2023.
language models,” arXiv preprint arXiv:2112.00114, 2021.
[131] Y. Qu and H. Nobuhara, “Explanation generated for sequential
[110] A. K. Lampinen, I. Dasgupta, S. C. Chan, K. Matthewson, M. H.
recommendation based on transformer model,” in 2022 Joint 12th
Tessler, A. Creswell, J. L. McClelland, J. X. Wang, and F. Hill,
International Conference on Soft Computing and Intelligent Systems
“Can language models learn from explanations in context?” arXiv
and 23rd International Symposium on Advanced Intelligent Systems
preprint arXiv:2204.02329, 2022.
(SCIS&ISIS). IEEE, 2022, pp. 1–6.
[111] J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and
D. Zhou, “Chain of thought prompting elicits reasoning in large [132] P. Bai, Y. Xia, and Y. Xia, “Fusing knowledge and aspect sentiment
language models,” arXiv preprint arXiv:2201.11903, 2022. for explainable recommendation,” IEEE Access, vol. 8, pp. 137 150–
137 160, 2020.
[112] E. Zelikman, Y. Wu, J. Mu, and N. Goodman, “Star: Bootstrap-
ping reasoning with reasoning,” Advances in Neural Information [133] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von
Processing Systems, vol. 35, pp. 15 476–15 488, 2022. Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill et al., “On
[113] Y. Zhang, X. Chen et al., “Explainable recommendation: A survey the opportunities and risks of foundation models,” arXiv preprint
and new perspectives,” Foundations and Trends® in Information arXiv:2108.07258, 2021.
Retrieval, vol. 14, no. 1, pp. 1–101, 2020. [134] L. Li, Y. Zhang, and L. Chen, “Personalized prompt learning for
[114] J. B. Schafer, J. Konstan, and J. Riedl, “Recommender systems in explainable recommendation,” ACM Transactions on Information
e-commerce,” in Proceedings of the 1st ACM conference on Electronic Systems, vol. 41, no. 4, pp. 1–26, 2023.
commerce, 1999, pp. 158–166. [135] S. Bills, N. Cammarata, D. Mossing, H. Tillman, L. Gao, G. Goh,
[115] G. Linden, B. Smith, and J. York, “Amazon. com recommendations: I. Sutskever, J. Leike, J. Wu, and W. Saunders, “Language models
Item-to-item collaborative filtering,” IEEE Internet computing, can explain neurons in language models,” 2023.
vol. 7, no. 1, pp. 76–80, 2003. [136] Z. Wu, A. Geiger, C. Potts, and N. D. Goodman, “Interpretability
[116] C. A. Gomez-Uribe and N. Hunt, “The netflix recommender at scale: Identifying causal mechanisms in alpaca,” arXiv preprint
system: Algorithms, business value, and innovation,” ACM arXiv:2305.08809, 2023.
Transactions on Management Information Systems (TMIS), vol. 6, [137] Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J.-G. Lou, and W. Chen,
no. 4, pp. 1–19, 2015. “Making large language models better reasoners with step-aware
[117] R. Sinha and K. Swearingen, “The role of transparency in verifier,” 2023.
recommender systems,” in CHI’02 extended abstracts on Human [138] M. Turpin, J. Michael, E. Perez, and S. R. Bowman, “Language
factors in computing systems, 2002, pp. 830–831. models don’t always say what they think: Unfaithful explanations
[118] Y. Xian, T. Zhao, J. Li, J. Chan, A. Kan, J. Ma, X. L. Dong, in chain-of-thought prompting,” arXiv preprint arXiv:2305.04388,
C. Faloutsos, G. Karypis, S. Muthukrishnan et al., “Ex3: Explain- 2023.
able attribute-aware item-set recommendations,” in Proceedings [139] S. Li, H. Liu, T. Dong, B. Z. H. Zhao, M. Xue, H. Zhu, and
of the 15th ACM Conference on Recommender Systems, 2021, pp. J. Lu, “Hidden backdoors in human-centric language models,” in
484–494. Proceedings of the 2021 ACM SIGSAC Conference on Computer and
[119] X. Wang, Q. Li, D. Yu, and G. Xu, “Reinforced path reasoning Communications Security, 2021, pp. 3123–3140.
for counterfactual explainable recommendation,” arXiv preprint [140] J. Wang, X. Hu, W. Hou, H. Chen, R. Zheng, Y. Wang, L. Yang,
arXiv:2207.06674, 2022. H. Huang, W. Ye, X. Geng et al., “On the robustness of chatgpt:
21

An adversarial and out-of-distribution perspective,” arXiv preprint the 28th ACM SIGKDD Conference on Knowledge Discovery and Data
arXiv:2302.12095, 2023. Mining, 2022, pp. 3309–3317.
[141] R. Han, T. Peng, C. Yang, B. Wang, L. Liu, and X. Wan, “Is [163] M. Tsang, D. Cheng, H. Liu, X. Feng, E. Zhou, and Y. Liu,
information extraction solved by chatgpt? an analysis of perfor- “Feature interaction interpretability: A case for explaining ad-
mance, evaluation criteria, robustness and errors,” arXiv preprint recommendation systems via neural interaction detection,” arXiv
arXiv:2305.14450, 2023. preprint arXiv:2006.10966, 2020.
[142] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, [164] L. Yuanfei, W. Mengshuo, Z. Hao, Y. Quanming, T. WeiWei,
D. Yogatama, M. Bosma, D. Zhou, D. Metzler et al., “Emergent C. Yuqiang, Y. Qiang, and D. Wenyuan, “Autocross: Automatic
abilities of large language models,” arXiv preprint arXiv:2206.07682, feature crossing for tabular data in real-world applications,” arXiv
2022. preprint arXiv:1904.12857, 2019.
[143] J. Liu, C. Liu, R. Lv, K. Zhou, and Y. Zhang, “Is chatgpt a good rec- [165] B. Liu, C. Zhu, G. Li, W. Zhang, J. Lai, R. Tang, X. He, Z. Li,
ommender? a preliminary study,” arXiv preprint arXiv:2304.10149, and Y. Yu, “Autofis: Automatic feature interaction selection
2023. in factorization models for click-through rate prediction,” in
[144] S. Dai, N. Shao, H. Zhao, W. Yu, Z. Si, C. Xu, Z. Sun, X. Zhang, Proceedings of the 26th ACM SIGKDD International Conference on
and J. Xu, “Uncovering chatgpt’s capabilities in recommender Knowledge Discovery & Data Mining, 2020, pp. 2636–2645.
systems,” arXiv preprint arXiv:2305.02182, 2023. [166] B. Liu, N. Xue, H. Guo, R. Tang, S. Zafeiriou, X. He, and Z. Li,
[145] L. Wang and E.-P. Lim, “Zero-shot next-item recommenda- “Autogroup: Automatic feature grouping for modelling explicit
tion using large pretrained language models,” arXiv preprint high-order feature interactions in ctr prediction,” in Proceedings
arXiv:2304.03153, 2023. of the 43rd international ACM SIGIR conference on research and
[146] Y. Hou, J. Zhang, Z. Lin, H. Lu, R. Xie, J. McAuley, and W. X. Zhao, development in information retrieval, 2020, pp. 199–208.
“Large language models are zero-shot rankers for recommender [167] Y. Chen, P. Ren, Y. Wang, and M. de Rijke, “Bayesian personalized
systems,” arXiv preprint arXiv:2305.08845, 2023. feature interaction selection for factorization machines,” in Pro-
[147] X. Li, Y. Zhang, and E. C. Malthouse, “A preliminary study ceedings of the 42nd International ACM SIGIR Conference on Research
of chatgpt on news recommendation: Personalization, provider and Development in Information Retrieval, 2019, pp. 665–674.
fairness, fake news,” arXiv preprint arXiv:2306.10702, 2023. [168] Y. Xie, Z. Wang, Y. Li, B. Ding, N. M. Gürel, C. Zhang, M. Huang,
[148] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, W. Lin, and J. Zhou, “Fives: Feature interaction via edge search for
and Z. Sui, “A survey for in-context learning,” arXiv preprint large-scale tabular data,” in Proceedings of the 27th ACM SIGKDD
arXiv:2301.00234, 2022. Conference on Knowledge Discovery & Data Mining, 2021, pp. 3795–
[149] D. Dai, Y. Sun, L. Dong, Y. Hao, Z. Sui, and F. Wei, “Why can 3805.
gpt learn in-context? language models secretly perform gradient [169] Y. Su, R. Zhang, S. Erfani, and Z. Xu, “Detecting beneficial feature
descent as meta optimizers,” arXiv preprint arXiv:2212.10559, 2022. interactions for recommender systems,” in Proceedings of the AAAI
[150] S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi, conference on artificial intelligence, vol. 35, no. 5, 2021, pp. 4357–4365.
and L. Zettlemoyer, “Rethinking the role of demonstrations: What [170] Q. Song, D. Cheng, H. Zhou, J. Yang, Y. Tian, and X. Hu, “Towards
makes in-context learning work?” arXiv preprint arXiv:2202.12837, automated neural interaction discovery for click-through rate
2022. prediction,” in Proceedings of the 26th ACM SIGKDD International
[151] I. Levy, B. Bogin, and J. Berant, “Diverse demonstrations im- Conference on Knowledge Discovery & Data Mining, 2020, pp. 945–
prove in-context compositional generalization,” arXiv preprint 955.
arXiv:2212.06800, 2022. [171] P. Zhao, K. Xiao, Y. Zhang, K. Bian, and W. Yan, “Ameir: Automatic
[152] S. M. Xie, A. Raghunathan, P. Liang, and T. Ma, “An explanation of behavior modeling, interaction exploration and mlp investigation
in-context learning as implicit bayesian inference,” in International in the recommender system.” in IJCAI, 2021, pp. 2104–2110.
Conference on Learning Representations. [172] Z. Wei, X. Wang, and W. Zhu, “Autoias: Automatic integrated ar-
[153] C. Olsson, N. Elhage, N. Nanda, N. Joseph, N. DasSarma, chitecture searcher for click-trough rate prediction,” in Proceedings
T. Henighan, B. Mann, A. Askell, Y. Bai, A. Chen et al., “In-context of the 30th ACM International Conference on Information & Knowledge
learning and induction heads,” arXiv preprint arXiv:2209.11895, Management, 2021, pp. 2101–2110.
2022. [173] M. Cheng, Z. Liu, Q. Liu, S. Ge, and E. Chen, “Towards automatic
[154] E. Akyürek, D. Schuurmans, J. Andreas, T. Ma, and D. Zhou, discovering of deep hybrid network architecture for sequential
“What learning algorithm is in-context learning? investigations recommendation,” in Proceedings of the ACM Web Conference 2022,
with linear models,” arXiv preprint arXiv:2211.15661, 2022. 2022, pp. 1923–1932.
[155] S. Geng, S. Liu, Z. Fu, Y. Ge, and Y. Zhang, “Recommendation [174] C. Yu, X. Liu, C. Tang, W. Feng, and J. Lv, “Gpt-nas: Neural
as language processing (rlp): A unified pretrain, personalized architecture search with the generative pre-trained model,” arXiv
prompt & predict paradigm (p5),” in Proceedings of the 16th ACM preprint arXiv:2305.05351, 2023.
Conference on Recommender Systems, 2022, pp. 299–315. [175] C. Ying, A. Klein, E. Christiansen, E. Real, K. Murphy, and F. Hut-
[156] Z. Cui, J. Ma, C. Zhou, J. Zhou, and H. Yang, “M6-rec: Genera- ter, “Nas-bench-101: Towards reproducible neural architecture
tive pretrained language models are open-ended recommender search,” in International Conference on Machine Learning. PMLR,
systems,” arXiv preprint arXiv:2205.08084, 2022. 2019, pp. 7105–7114.
[157] S. Liu, C. Gao, Y. Chen, D. Jin, and Y. Li, “Learnable embedding [176] M. Zheng, X. Su, S. You, F. Wang, C. Qian, C. Xu, and S. Albanie,
sizes for recommender systems,” arXiv preprint arXiv:2101.07577, “Can gpt-4 perform neural architecture search?” arXiv preprint
2021. arXiv:2304.10970, 2023.
[158] H. Liu, X. Zhao, C. Wang, X. Liu, and J. Tang, “Automated embed- [177] M. U. Nasir, S. Earle, J. Togelius, S. James, and C. Cleghorn,
ding size search in deep recommender systems,” in Proceedings “Llmatic: Neural architecture search via large language models and
of the 43rd International ACM SIGIR Conference on Research and quality-diversity optimization,” arXiv preprint arXiv:2306.01102,
Development in Information Retrieval, 2020, pp. 2307–2316. 2023.
[159] W. Deng, J. Pan, T. Zhou, D. Kong, A. Flores, and G. Lin, [178] A. Chen, D. M. Dohan, and D. R. So, “Evoprompting: Language
“Deeplight: Deep lightweight feature interactions for accelerating models for code-level neural architecture search,” arXiv preprint
ctr predictions in ad serving,” in Proceedings of the 14th ACM arXiv:2302.14838, 2023.
international conference on Web search and data mining, 2021, pp. [179] L. Shang, Z. Lu, and H. Li, “Neural responding machine for
922–930. short-text conversation,” arXiv preprint arXiv:1503.02364, 2015.
[160] A. A. Ginart, M. Naumov, D. Mudigere, J. Yang, and J. Zou, [180] O. Vinyals and Q. Le, “A neural conversational model,” arXiv
“Mixed dimension embeddings with application to memory- preprint arXiv:1506.05869, 2015.
efficient recommendation systems,” in 2021 IEEE International [181] A. Sordoni, M. Galley, M. Auli, C. Brockett, Y. Ji, M. Mitchell,
Symposium on Information Theory (ISIT). IEEE, 2021, pp. 2786– J.-Y. Nie, J. Gao, and B. Dolan, “A neural network approach to
2791. context-sensitive generation of conversational responses,” arXiv
[161] Y. Wang, X. Zhao, T. Xu, and X. Wu, “Autofield: Automating preprint arXiv:1506.06714, 2015.
feature selection in deep recommender systems,” in Proceedings of [182] W. Wu and R. Yan, “Deep chit-chat: Deep learning for chatbots,”
the ACM Web Conference 2022, 2022, pp. 1977–1986. in Proceedings of the 42nd International ACM SIGIR Conference on
[162] W. Lin, X. Zhao, Y. Wang, T. Xu, and X. Wu, “Adafs: Adaptive Research and Development in Information Retrieval, 2019, pp. 1413–
feature selection in deep recommender system,” in Proceedings of 1414.
22

[183] X. Qiu and X. Huang, “Convolutional neural tensor network prompt learning,” in Proceedings of the 28th ACM SIGKDD Confer-
architecture for community-based question answering,” in Twenty- ence on Knowledge Discovery and Data Mining, 2022, pp. 1929–1937.
Fourth international joint conference on artificial intelligence, 2015. [204] L. Wang, H. Hu, L. Sha, C. Xu, D. Jiang, and K.-F. Wong, “Recindial:
[184] S. Wan, Y. Lan, J. Guo, J. Xu, L. Pang, and X. Cheng, “A A unified framework for conversational recommendation with
deep architecture for semantic matching with multiple positional pretrained language models,” in Proceedings of the 2nd Conference of
sentence representations,” in Proceedings of the AAAI Conference on the Asia-Pacific Chapter of the Association for Computational Linguistics
Artificial Intelligence, vol. 30, no. 1, 2016. and the 12th International Joint Conference on Natural Language
[185] Y. Sun and Y. Zhang, “Conversational recommender system,” in Processing, 2022, pp. 489–500.
The 41st international acm sigir conference on research & development [205] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-
in information retrieval, 2018, pp. 235–244. training of deep bidirectional transformers for language under-
[186] C. Greco, A. Suglia, P. Basile, and G. Semeraro, “Converse-et- standing,” arXiv preprint arXiv:1810.04805, 2018.
impera: Exploiting deep learning and hierarchical reinforcement [206] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al.,
learning for conversational recommender systems,” in AI* IA 2017 “Improving language understanding by generative pre-training,”
Advances in Artificial Intelligence: XVIth International Conference of 2018.
the Italian Association for Artificial Intelligence, Bari, Italy, November
[207] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin,
14-17, 2017, Proceedings 16. Springer, 2017, pp. 372–386.
C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language
[187] K. Yao, G. Zweig, M.-Y. Hwang, Y. Shi, and D. Yu, “Recurrent models to follow instructions with human feedback,” Advances
neural networks for language understanding.” in Interspeech, 2013, in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744,
pp. 2524–2528. 2022.
[188] G. Mesnil, X. He, L. Deng, and Y. Bengio, “Investigation of
[208] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux,
recurrent-neural-network architectures and learning methods for
T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama:
spoken language understanding.” in Interspeech, 2013, pp. 3771–
Open and efficient foundation language models,” arXiv preprint
3775.
arXiv:2302.13971, 2023.
[189] D. Goddeau, H. Meng, J. Polifroni, S. Seneff, and
S. Busayapongchai, “A form-based dialogue manager for [209] Y. Gao, T. Sheng, Y. Xiang, Y. Xiong, H. Wang, and J. Zhang,
spoken language applications,” in Proceeding of Fourth International “Chat-rec: Towards interactive and explainable llms-augmented
Conference on Spoken Language Processing. ICSLP’96, vol. 2. IEEE, recommender system,” arXiv preprint arXiv:2303.14524, 2023.
1996, pp. 701–704. [210] L. Friedman, S. Ahuja, D. Allen, T. Tan, H. Sidahmed, C. Long,
[190] M. Henderson, B. Thomson, and S. Young, “Deep neural network J. Xie, G. Schubiner, A. Patel, H. Lara et al., “Leveraging large
approach for the dialog state tracking challenge,” in Proceedings of language models in conversational recommender systems,” arXiv
the SIGDIAL 2013 Conference, 2013, pp. 467–471. preprint arXiv:2305.07961, 2023.
[191] N. Mrkšić, D. O. Séaghdha, T.-H. Wen, B. Thomson, and S. Young, [211] X. Wang, X. Tang, W. X. Zhao, J. Wang, and J.-R. Wen, “Rethinking
“Neural belief tracker: Data-driven dialogue state tracking,” arXiv the evaluation for conversational recommendation in the era of
preprint arXiv:1606.03777, 2016. large language models,” arXiv preprint arXiv:2305.13112, 2023.
[192] H. Cuayáhuitl, S. Keizer, and O. Lemon, “Strategic dialogue [212] S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and
management via deep reinforcement learning,” arXiv preprint K. Narasimhan, “Tree of thoughts: Deliberate problem solving
arXiv:1511.08099, 2015. with large language models,” arXiv preprint arXiv:2305.10601, 2023.
[193] H. Zhou, M. Huang, and X. Zhu, “Context-aware natural language [213] L. Wang, W. Xu, Y. Lan, Z. Hu, Y. Lan, R. K.-W. Lee, and E.-
generation for spoken dialogue systems,” in Proceedings of COLING P. Lim, “Plan-and-solve prompting: Improving zero-shot chain-
2016, the 26th International Conference on Computational Linguistics: of-thought reasoning by large language models,” arXiv preprint
Technical Papers, 2016, pp. 2032–2041. arXiv:2305.04091, 2023.
[194] O. Dušek and F. Jurčíček, “Sequence-to-sequence generation for [214] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and
spoken dialogue via deep syntax trees and strings,” arXiv preprint Y. Cao, “React: Synergizing reasoning and acting in language
arXiv:1606.05491, 2016. models,” arXiv preprint arXiv:2210.03629, 2022.
[195] T.-H. Wen, D. Vandyke, N. Mrksic, M. Gasic, L. M. Rojas- [215] A. Madaan, N. Tandon, P. Clark, and Y. Yang, “Memory-assisted
Barahona, P.-H. Su, S. Ultes, and S. Young, “A network-based prompt editing to improve gpt-3 after deployment,” arXiv preprint
end-to-end trainable task-oriented dialogue system,” arXiv preprint arXiv:2201.06009, 2022.
arXiv:1604.04562, 2016. [216] Y. Qin, S. Hu, Y. Lin, W. Chen, N. Ding, G. Cui, Z. Zeng, Y. Huang,
[196] A. Bordes, Y.-L. Boureau, and J. Weston, “Learning end-to-end C. Xiao, C. Han et al., “Tool learning with foundation models,”
goal-oriented dialog,” arXiv preprint arXiv:1605.07683, 2016. arXiv preprint arXiv:2304.08354, 2023.
[197] Y. Zhang, S. Sun, M. Galley, Y.-C. Chen, C. Brockett, X. Gao, [217] G. Mialon, R. Dessì, M. Lomeli, C. Nalmpantis, R. Pasunuru,
J. Gao, J. Liu, and B. Dolan, “Dialogpt: Large-scale generative pre- R. Raileanu, B. Rozière, T. Schick, J. Dwivedi-Yu, A. Celikyilmaz
training for conversational response generation,” arXiv preprint et al., “Augmented language models: a survey,” arXiv preprint
arXiv:1911.00536, 2019. arXiv:2302.07842, 2023.
[198] W. Lei, X. He, Y. Miao, Q. Wu, R. Hong, M.-Y. Kan, and T.-S. Chua,
[218] K. Yang, N. Peng, Y. Tian, and D. Klein, “Re3: Generating longer
“Estimation-action-reflection: Towards deep interaction between
stories with recursive reprompting and revision,” arXiv preprint
conversational and recommender systems,” in Proceedings of the
arXiv:2210.06774, 2022.
13th International Conference on Web Search and Data Mining, 2020,
pp. 304–312. [219] T. Schick, J. Dwivedi-Yu, Z. Jiang, F. Petroni, P. Lewis, G. Izac-
[199] W. Lei, G. Zhang, X. He, Y. Miao, X. Wang, L. Chen, and T.-S. ard, Q. You, C. Nalmpantis, E. Grave, and S. Riedel, “Peer: A
Chua, “Interactive path reasoning on graph for conversational collaborative language model,” arXiv preprint arXiv:2208.11663,
recommendation,” in Proceedings of the 26th ACM SIGKDD inter- 2022.
national conference on knowledge discovery & data mining, 2020, pp. [220] Y. Hao, H. Song, L. Dong, S. Huang, Z. Chi, W. Wang, S. Ma, and
2073–2083. F. Wei, “Language models are general-purpose interfaces,” arXiv
[200] Y. Deng, Y. Li, F. Sun, B. Ding, and W. Lam, “Unified con- preprint arXiv:2206.06336, 2022.
versational recommendation policy learning via graph-based [221] G. Izacard, P. Lewis, M. Lomeli, L. Hosseini, F. Petroni, T. Schick,
reinforcement learning,” in Proceedings of the 44th International J. Dwivedi-Yu, A. Joulin, S. Riedel, and E. Grave, “Few-shot
ACM SIGIR Conference on Research and Development in Information learning with retrieval augmented language models,” arXiv
Retrieval, 2021, pp. 1431–1441. preprint arXiv:2208.03299, 2022.
[201] R. Li, S. Ebrahimi Kahou, H. Schulz, V. Michalski, L. Charlin, [222] R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha,
and C. Pal, “Towards deep conversational recommendations,” H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y. Du et al., “Lamda: Language
Advances in neural information processing systems, vol. 31, 2018. models for dialog applications,” arXiv preprint arXiv:2201.08239,
[202] T.-C. Wang, S.-Y. Su, and Y.-N. Chen, “Barcor: Towards a unified 2022.
framework for conversational recommendation systems,” arXiv [223] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim,
preprint arXiv:2203.14257, 2022. C. Hesse, S. Jain, V. Kosaraju, W. Saunders et al., “Webgpt: Browser-
[203] X. Wang, K. Zhou, J.-R. Wen, and W. X. Zhao, “Towards unified assisted question-answering with human feedback,” arXiv preprint
conversational recommender systems via knowledge-enhanced arXiv:2112.09332, 2021.
23

[224] R. Liu, J. Wei, S. S. Gu, T.-Y. Wu, S. Vosoughi, C. Cui, D. Zhou, advertisements,” in Proceedings of the 27th ACM SIGKDD Conference
and A. M. Dai, “Mind’s eye: Grounded language model reasoning on Knowledge Discovery & Data Mining, 2021, pp. 3697–3707.
through simulation,” arXiv preprint arXiv:2210.05359, 2022. [245] C. Chen, X. Wang, X. Yi, F. Wu, X. Xie, and R. Yan, “Personalized
[225] L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan, and chit-chat generation for recommendation using external chat
G. Neubig, “Pal: Program-aided language models,” arXiv preprint corpora,” in Proceedings of the 28th ACM SIGKDD Conference on
arXiv:2211.10435, 2022. Knowledge Discovery and Data Mining, 2022, pp. 2721–2731.
[226] M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, [246] C. Zhang, J. Zhou, X. Zang, Q. Xu, L. Yin, X. He, L. Liu, H. Xiong,
C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman et al., “Do as i can, and D. Dou, “Chase: Commonsense-enriched advertising on
not as i say: Grounding language in robotic affordances,” arXiv search engine with explicit knowledge,” in Proceedings of the
preprint arXiv:2204.01691, 2022. 30th ACM International Conference on Information & Knowledge
[227] Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang, “Hugging- Management, 2021, pp. 4352–4361.
gpt: Solving ai tasks with chatgpt and its friends in huggingface,” [247] Y. S. Kanungo, S. Negi, and A. Rajan, “Ad headline generation
arXiv preprint arXiv:2303.17580, 2023. using self-critical masked language model,” in Proceedings of the
[228] C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, and N. Duan, “Visual 2021 Conference of the North American Chapter of the Association for
chatgpt: Talking, drawing and editing with visual foundation Computational Linguistics: Human Language Technologies: Industry
models,” arXiv preprint arXiv:2303.04671, 2023. Papers, 2021, pp. 263–271.
[229] Y. Liang, C. Wu, T. Song, W. Wu, Y. Xia, Y. Liu, Y. Ou, S. Lu, [248] P. Wei, X. Yang, S. Liu, L. Wang, and B. Zheng, “Creater: Ctr-
L. Ji, S. Mao et al., “Taskmatrix. ai: Completing tasks by con- driven advertising text generation with controlled pre-training
necting foundation models with millions of apis,” arXiv preprint and contrastive fine-tuning,” arXiv preprint arXiv:2205.08943, 2022.
arXiv:2303.16434, 2023. [249] Y. S. Kanungo, G. Das, and S. Negi, “Cobart: Controlled, op-
[230] T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, timized, bidirectional and auto-regressive transformer for ad
L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Lan- headline generation,” in Proceedings of the 28th ACM SIGKDD
guage models can teach themselves to use tools,” arXiv preprint Conference on Knowledge Discovery and Data Mining, 2022, pp. 3127–
arXiv:2302.04761, 2023. 3136.
[231] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language- [250] Q. Chen, J. Lin, Y. Zhang, H. Yang, J. Zhou, and J. Tang, “Towards
image pre-training for unified vision-language understanding knowledge-based personalized product description generation in
and generation,” in International Conference on Machine Learning. e-commerce,” in Proceedings of the 25th ACM SIGKDD International
PMLR, 2022, pp. 12 888–12 900. Conference on Knowledge Discovery & Data Mining, 2019, pp. 3040–
[232] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, 3050.
“High-resolution image synthesis with latent diffusion models,” [251] Y. Cao, S. Li, Y. Liu, Z. Yan, Y. Dai, P. S. Yu, and L. Sun, “A
in Proceedings of the IEEE/CVF Conference on Computer Vision and comprehensive survey of ai-generated content (aigc): A history of
Pattern Recognition, 2022, pp. 10 684–10 695. generative ai from gan to chatgpt,” arXiv preprint arXiv:2303.04226,
[233] T. Cai, X. Wang, T. Ma, X. Chen, and D. Zhou, “Large language 2023.
models as tool makers,” arXiv preprint arXiv:2305.17126, 2023. [252] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
[234] K. Shuster, J. Xu, M. Komeili, D. Ju, E. M. Smith, S. Roller, M. Ung, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial
M. Chen, K. Arora, J. Lane et al., “Blenderbot 3: a deployed con- nets,” Advances in Neural Information Processing Systems, vol. 27,
versational agent that continually learns to responsibly engage,” 2014.
arXiv preprint arXiv:2208.03188, 2022. [253] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”
[235] J. Liu, J. Jin, Z. Wang, J. Cheng, Z. Dou, and J.-R. Wen, “Reta-llm: A arXiv preprint arXiv:1312.6114, 2013.
retrieval-augmented large language model toolkit,” arXiv preprint [254] L. Dinh, D. Krueger, and Y. Bengio, “Nice: Non-linear independent
arXiv:2306.05212, 2023. components estimation,” arXiv preprint arXiv:1410.8516, 2014.
[236] S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, and X. Wu, “Unifying [255] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic
large language models and knowledge graphs: A roadmap,” arXiv models,” Advances in Neural Information Processing Systems, vol. 33,
preprint arXiv:2306.08302, 2023. pp. 6840–6851, 2020.
[237] S. Vempati, K. T. Malayil, V. Sruthi, and R. Sandeep, “Enabling [256] J. D. M.-W. C. Kenton and L. K. Toutanova, “Bert: Pre-training of
hyper-personalisation: Automated ad creative generation and deep bidirectional transformers for language understanding,” in
ranking for fashion e-commerce,” in Fashion Recommender Systems. Proceedings of NAACL-HLT, 2019, pp. 4171–4186.
Springer, 2020, pp. 25–48. [257] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
[238] S. Thomaidou, I. Lourentzou, P. Katsivelis-Perakis, and M. Vazir- T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly
giannis, “Automated snippet generation for online advertising,” in et al., “An image is worth 16x16 words: Transformers for im-
Proceedings of the 22nd ACM international conference on Information age recognition at scale,” in International Conference on Learning
& Knowledge Management, 2013, pp. 1841–1844. Representations.
[239] X. Zhang, Y. Zou, H. Zhang, J. Zhou, S. Diao, J. Chen, Z. Ding, [258] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo,
Z. He, X. He, Y. Xiao et al., “Automatic product copywriting for “Swin transformer: Hierarchical vision transformer using shifted
e-commerce,” in Proceedings of the AAAI Conference on Artificial windows,” in Proceedings of the IEEE/CVF international conference
Intelligence, vol. 36, no. 11, 2022, pp. 12 423–12 431. on computer vision, 2021, pp. 10 012–10 022.
[240] Z. Lei, C. Zhang, X. Xu, W. Wu, Z.-Y. Niu, H. Wu, H. Wang, Y. Yang, [259] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar-
and S. Li, “Plato-ad: A unified advertisement text generation wal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning
framework with multi-task prompt learning,” in Proceedings of the transferable visual models from natural language supervision,”
2022 Conference on Empirical Methods in Natural Language Processing: in International conference on machine learning. PMLR, 2021, pp.
Industry Track, 2022, pp. 512–520. 8748–8763.
[241] K. Bartz, C. Barr, and A. Aijaz, “Natural language generation for [260] M. Chen, X. Tan, B. Li, Y. Liu, T. Qin, T.-Y. Liu et al., “Adaspeech:
sponsored-search advertisements,” in Proceedings of the 9th ACM Adaptive text to speech for custom voice,” in International Confer-
Conference on Electronic Commerce, 2008, pp. 1–9. ence on Learning Representations.
[242] A. Fujita, K. Ikushima, S. Sato, R. Kamite, K. Ishiyama, and [261] X. Li, A. Taheri, L. Tu, and K. Gimpel, “Commonsense knowledge
O. Tamachi, “Automatic generation of listing ads by reusing base completion,” in Proceedings of the 54th Annual Meeting of the
promotional texts,” in Proceedings of the 12th international conference Association for Computational Linguistics (Volume 1: Long Papers),
on electronic commerce: Roadmap for the future of electronic business, 2016, pp. 1445–1455.
2010, pp. 179–188. [262] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou,
[243] J. W. Hughes, K.-h. Chang, and R. Zhang, “Generating better B. Qin, T. Liu, D. Jiang et al., “Codebert: A pre-trained model for
search engine text advertisements with deep reinforcement programming and natural languages,” in Findings of the Association
learning,” in Proceedings of the 25th ACM SIGKDD International for Computational Linguistics: EMNLP 2020, 2020, pp. 1536–1547.
Conference on Knowledge Discovery & Data Mining, 2019, pp. 2269– [263] OpenAI, “ChatGPT: A Large-Scale Generative Model for Conver-
2277. sation,” OpenAI Blog, 2020.
[244] X. Wang, X. Gu, J. Cao, Z. Zhao, Y. Yan, B. Middha, and X. Xie, [264] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford,
“Reinforcing pretrained models for generating attractive text M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,”
24

in International Conference on Machine Learning. PMLR, 2021, pp.


8821–8831.
[265] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan,
H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evalu-
ating large language models trained on code,” arXiv preprint
arXiv:2107.03374, 2021.
[266] Midjourney, “Midjourney,” Retrieved from https://midjourney.
com, 2022.
[267] W. Wang, X. Lin, F. Feng, X. He, and T.-S. Chua, “Generative rec-
ommendation: Towards next-generation recommender paradigm,”
arXiv preprint arXiv:2304.03516, 2023.
[268] A. Borji, “A categorical archive of chatgpt failures,” arXiv preprint
arXiv:2302.03494, 2023.
[269] N. Carlini, F. Tramer, E. Wallace et al., “Extracting training data
from large language models.” 2021.
[270] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever
et al., “Language models are unsupervised multitask learners.”
[271] E. J. Hu, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen
et al., “Lora: Low-rank adaptation of large language models,” in
International Conference on Learning Representations, 2021.
[272] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer,
“Qlora: Efficient finetuning of quantized llms,” arXiv preprint
arXiv:2305.14314, 2023.
[273] Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones,
A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon et al., “Con-
stitutional ai: Harmlessness from ai feedback,” arXiv preprint
arXiv:2212.08073, 2022.

You might also like