Prompting LLM For RecSys
Prompting LLM For RecSys
LANLING XU and JUNJIE ZHANG, Gaoling School of Artificial Intelligence, Renmin University of China, China
BINGQIAN LI, Beijing Key Laboratory of Big Data Management and Analysis Methods, China
WAYNE XIN ZHAO∗ and JI-RONG WEN, Gaoling School of Artificial Intelligence, Renmin University, China
arXiv:2401.04997v1 [cs.IR] 10 Jan 2024
Recently, large language models such as ChatGPT have showcased remarkable abilities in solving general tasks, demonstrating the
potential for applications in recommender systems. To assess how effectively LLMs can be used in recommendation tasks, our study
primarily focuses on employing LLMs as recommender systems through prompting engineering. We propose a general framework for
utilizing LLMs in recommendation tasks, focusing on the capabilities of LLMs as recommenders. To conduct our analysis, we formalize
the input of LLMs for recommendation into natural language prompts with two key aspects, and explain how our framework can be
generalized to various recommendation scenarios. As for the use of LLMs as recommenders, we analyze the impact of public availability,
tuning strategies, model architecture, parameter scale, and context length on recommendation results based on the classification of
LLMs. As for prompt engineering, we further analyze the impact of four important components of prompts, i.e., task descriptions, user
interest modeling, candidate items construction and prompting strategies. In each section, we first define and categorize concepts in
line with the existing literature. Then, we propose inspiring research questions followed by experiments to systematically analyze the
impact of different factors on two public datasets. Finally, we summarize promising directions to shed lights on future research.
Additional Key Words and Phrases: Large Language Models, Recommender Systems, Empirical Study
Lanling Xu, Junjie Zhang, Bingqian Li, Jinpeng Wang, Mingchen Cai, Wayne Xin Zhao, and Ji-Rong Wen. 2018. Prompting Large
Language Models for Recommender Systems: A Comprehensive Framework and Empirical Analysis. In . ACM, New York, NY, USA,
46 pages. https://doi.org/XXX.XXX
1 INTRODUCTION
In order to alleviate the problem of information overload [31, 76], recommender systems explore the needs of users and
provide them with recommendations based on their historical interactions, which are widely studied in both industry
∗ Wayne Xin Zhao ([email protected]) is the corresponding author.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
© 2018 Association for Computing Machinery.
Manuscript submitted to ACM
1
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Lanling Xu et al.
and academia [23, 28, 29, 84]. Over the past decade, various recommendation algorithms have been proposed to solve
recommendation tasks by capturing the personalized interaction patterns from user behaviors [39, 144]. Despite the
progress of conventional recommenders, the performance is highly dependent on the limited training data from a few
datasets and domains, and there are two major drawbacks. On the one hand, traditional models lack the general world
knowledge beyond interaction sequences. For complex scenarios that need to think or plan, existing methods do not
have commonsense knowledge to solve such tasks [27, 71, 112, 119]. On the other hand, traditional models cannot
truly understand intentions and preferences of users. The recommendation results do not have explainability, and
requirements expressed by users in explicit forms such as natural languages are difficult to consider [47, 52, 126].
Recently, Large Language Models (LLMs) such as ChatGPT have demonstrated impressive abilities in solving general
tasks [24, 118], showing their potential in developing next-generation recommender systems. The advantages of
incorporating LLMs into recommendation tasks are two-fold. Firstly, the excellent performance of LLMs in complex
reasoning tasks indicates the rich world knowledge and superior inference ability, which can effectively compensate
for the local knowledge of traditional recommenders [1, 75, 85]. Secondly, the language modeling abilities of LLMs
can seamlessly integrate massive textual data, enabling them to extract features beyond IDs and even understand user
preferences explicitly [30, 50]. Therefore, researchers have attempted to leverage LLMs for recommendation tasks.
Typically, there are three ways to employ LLMs to make recommendations: (1) LLMs can serve as the recommender to
make recommendation decisions, encompassing both discriminative and generative recommendations [3, 12, 20, 32, 133].
(2) LLMs can be leveraged to enhance traditional recommendation models by extracting semantic representations of
users and items from text corpora. The extensive semantic information and robust planning capabilities of LLMs are
integrated into traditional models [1, 16, 27, 31, 71, 107, 112, 119]. (3) LLMs are utilized as the recommendation simulator
to execute external generative agents in the recommendation process, where users and items may be empowered by
LLMs to stimulate the virtual environment [18, 100, 101, 130, 132]. We mainly focus on the first scenario in this paper.
Considering the gap between the general knowledge from large language models and the domain knowledge from
recommendation models [3, 136], there are two key factors for prompting LLMs as recommenders, i.e., how to select a
LLM as the foundation model and how to construct a prompt as the prompting text. As for LLMs, a growing number of
open-source and closed-source models have emerged, and the same model also has different variants due to settings such
as parameter scales and context lengths [140]. There are notable variations in the performance of different LLMs when
it comes to general language tasks such as generation and reasoning [24, 118]. However, the performance differences of
LLMs in recommendation tasks have not been fully explored. It is worth discussing how to select corresponding LLMs
for specific scenarios and develop corresponding training strategies. As for the prompt, it is an important medium for
interactions between humans and language models, and a well-designed prompt can better stimulate the powerful
capabilities of LLMs [43, 70]. To stimulate the recommendation ability of language models, prompt engineering should
involve not only task description and prompting strategies for general tasks, but also the incorporation of user interest
modeling and the creation of candidate items in recommender systems [17, 70, 125].
Although existing studies have made initial attempts to explore the recommendation capabilities of LLMs like Chat-
GPT [12, 20, 32, 68, 89], and some studies have used paradigms such as fine-tuning and instruction tuning to train
LLMs in the field of recommender systems [2, 3, 133, 141], they focus on exploring the performance of a certain task
rather instead of constructing a comprehensive framework to formalize the potential applications of LLM-powered
recommender systems. There are also systematic reviews concentrating on the progress of LLMs [140] and surveys of
2
Prompting Large Language Models for Recommender Systems Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
Table 1. An overview of the primary discoveries presented in our work. We summarize new findings in the second column as “new
findings”, and conduct experiments to verify findings discussed in existing literature as “re-validated findings”.
recommender systems empowered by LLMs [51, 61, 116]. However, previous surveys generally use specific criteria to
classify existing work and introduce them separately. They mainly focus on showcasing related work and summarizing
advantages and limitations, rather than conducting additional experiments to validate existing results and explore new
3
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Lanling Xu et al.
discoveries. Our work focuses on the ability of LLMs to directly serve as recommenders, aiming to establish a general
framework of Prompting Large Language Models for Recommendation (ProLLM4Rec).
In order to conduct our analysis for ProLLM4Rec, we formalize the input of LLMs for recommendation into natural
language prompts with two key aspects: LLMs and prompts, and explain how our framework can be generalized to
various recommendation scenarios and tasks. As for the use of LLMs as recommenders, we analyze the impact of
the public availability, tuning strategies, model architecture, parameter scale, and context length on recommendation
results based on the classification of LLMs. As for prompt engineering, we further analyze the impact of four important
components of prompts, i.e., task description, user interest modeling, candidate items construction, and prompting
strategies. Given personalized prompts that include task description and user interest, the LLM selects, generates, or
explains candidate items based on general world knowledge and personalized user profiles. For each module, we first
define and categorize concepts in line with the existing literature. Then, we propose inspiring research questions,
followed by detailed experiments to systematically analyze the impact of different conditions on the recommendation
performance. Based on the empirical analysis, we finally summarize empirical findings for future research.
• We derive a general framework ProLLM4Rec to sum up existing work of utilizing LLMs as foundation models for
recommendation, which can be generalized to multiple scenarios and tasks by different LLMs and prompts.
• We provide a systematic analysis on leveraging LLMs for recommendation, focusing on two aspects: LLMs and
prompt engineering. The use of LLMs includes analysis of public availability, tuning strategies, model architecture,
parameter scale, and context length. Moreover, prompt engineering consists of discussions on task description, user
interest modeling, candidate items construction and prompting strategies. For each aspect, we define and describe
with concepts first, and then provide reference solutions with experiments.
• Extensive experiments on two public datasets conclude key findings for recommendation with LLMs. As listed in
Table 1, our findings include experimental settings on each aspect of our proposed framework, and obtain empirical
experience on evaluating the performance of LLMs on recommendation tasks for future research.
In what follows, we first review the related work in Section 2. In Section 3, we present our proposed general framework
and its instantiation, and introduce overall settings of the following experiments. As the core components of this
paper, we discuss two main aspects of ProLLM4Rec, i.e., LLMs and prompts in Section 4 and Section 5, respectively. For
each aspect, we generalize key factors that affect recommendation results, and conduct corresponding experiments to
summarize empirical findings. At last, Section 6 concludes this paper and sheds lights on future directions.
2 RELATED WORK
2.1 Recommender Systems
For tackling the challenge of information overload [76, 93, 127], recommender systems have become pivotal tools for
delivering personalized contents for users across various domains. In line with previous studies, recommendation
algorithms aim to derive user preferences and behavioral patterns from their historical interactions. The most common
technique for the interaction-based recommendation is Collaborative Filtering (CF) [86, 90], which recommends items
based on preferences of similar users. Matrix Factorization (MF) [42] is a prevalent approach in collaborative filtering,
4
Prompting Large Language Models for Recommender Systems Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
and it constructs embedding representations for users and items from the interaction matrix, facilitating the algorithm
to calculate similarity scores efficiently. Furthermore, Neural Collaborative Filtering (NCF) [29], integrating deep
neural networks, replaces the inner product used in MF with a neural architecture, thereby demonstrating better
performance than previous methods. Contemporary advancements in deep neural network architectures have enhanced
the integration of user and item embeddings [66]. For example, since recommendation data can be represented as
graph-structured data, Graph Neural Network (GNN) [117] can be utilized to encode the information of the interaction
graph (nodes consist of users and items), and generate meaningful representations via message propagation and
contrastive learning strategies [28, 64, 104, 115]. As Pre-trained Language Models (PLM) gain prominence, there is a
growing interest in pre-trained large-scale recommendation models powered by PLMs [31, 135, 144]. In addition to
user-item pairs and IDs, content-based recommendation algorithms leverage auxiliary modalities such as textual and
visual information to augment user and item representations in recommendation tasks [80, 110, 127].
2.2.1 LLM as Recommendation Model. This paradigm takes the LLM as a recommender system. Employing diverse
strategies like pre-training, fine-tuning, or prompting, LLMs can combine general knowledge with input data to yield per-
sonalized recommendations for users [12, 32, 37]. Due to the variety of recommendation tasks, LLM as recommendation
model can be categorized into two types: discriminative recommendation and generative recommendation.
• Discriminative recommendation instructs LLMs to make recommendation decisions on the given candidate items,
usually focusing on item scoring [19] and re-ranking tasks [12]. For Click-Through Rate (CTR) prediction tasks, Liu et
al. [68] designed specific zero-shot and few-shot prompts to evaluate abilities of LLMs on rating predictions. LLMs
were required to assign a score for the item according to the previous rating history of users and the score range
given in prompts, while the result indicated that LLMs can outperform classical rating methods (e.g., MF and MLP)
in few-shot conditions [68]. Kang et al. [40] further formulated the rating prediction task as multi-class classification
and regression task, investigating the influence of model size on recommendation performance. Different from these
methods, Hou et al. [32] structured a re-ranking task, employing in-context learning approaches for LLMs to rank
items in the candidate pool. Previous studies highlighted the sensitivity of LLMs to the sequence of interaction histories
provided in prompts [74], which can be alleviated by strategies such as recency-focused prompting [32].
• Generative recommendation requires LLMs to generate items recommended to users, either from candidate item
lists within prompts or from LLMs with general knowledge [51]. GenRec [37] leveraged the contextual comprehension
5
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Lanling Xu et al.
ability of LLMs to transform interaction histories into formulated prompts for next-item predictions. To address instances
where GenRec might propose items absent in candidate lists, GPT4Rec [48] came up with the method that used BM25
algorithm to retrieve the most similar item in candidate item lists with the item generated by LLMs. In addition to
top-n recommendations, LLMs can be leveraged for generative tasks such as explainable recommendations [11, 44, 49,
73, 124] and review summarization [21, 69, 108]. Moreover, with the incredible abilities in dialogue comprehension
and communication, LLMs are naturally considered as the backbone of conversational and interactive recommender
systems. ChatRec [20] designed an interactive recommendation framework based on ChatGPT, which can comprehend
requirements of users through multi-turn dialogues and traditional recommendation models. Moreover, RecLLM [18]
combined the dialogue management module with a ranker module and a controllable LLM-based user simulator to
generate synthetic conversations for tuning system modules. Apart from these methods, InteRecAgent [35] employed
LLMs as the brain and recommender models as tools, combining their respective strengths to create an interactive
recommender system [35]. As a conversational recommender system, InteRecAgent enabled traditional recommender
systems to become interactive systems with a natural language interface through the integration of LLMs.
There are mainly two paradigms for adapting LLMs as recommenders, i.e., non-tuning paradigm and tuning paradigm.
• Non-tuning paradigm keeps parameters of LLMs fixed and extracts the general knowledge of LLMs with prompting
strategies. Existing work of non-tuning paradigm focuses on designing appropriate prompts to stimulate recommenda-
tion abilities of LLMs [58, 108, 125]. Liu et al. [68] proposed a prompt construction framework to evaluate abilities of
ChatGPT on five common recommendation tasks, each type of prompts contained zero-shot and few-shot versions.
Hou et al. [32] not only used prompts to evaluate abilities of LLMs on sequential recommendation, but also introduced
recency-focused prompting and in-context learning strategies to alleviate order perception and position bias issues of
LLMs. ChatRec [20] and InteRecAgent [35] mentioned above are also within the classic non-tuning paradigm.
• Tuning paradigm aims to update parameters of LLMs to inject recommendation capabilities into LLM itself. The
tuning strategies include fine-tuning [33, 141] and instruction tuning [72, 79]. P5 [21] proposed five types of instructions
targeting at different recommendation tasks to fine-tune a T5 [81] model. The instructions were formulated based
on conventional recommendation datasets with designed templates, which equipped LLMs with generation abilities
for unseen prompts or items [21]. InstructRec [133] further designed abundant instructions for tuning, including 39
manually designed templates with preference, intention, task form and context of a user. Compared with these methods,
TallRec [3] used LoRA [33], a parameter-efficient tuning method, to handle the two-stage tuning for LLMs. It was first
fine-tuned on general data of Alpaca [92], and then further fine-tuned with the historical information of users.
Although LLMs as recommendation models present a way of utilizing the common knowledge of LLMs, it still encounters
some problems to be coped with. Due to the high computational cost [60, 96] and slow inference time [54], LLMs are
struggled to be efficient enough compared to traditional recommendation methods [20, 32]. Additionally, constraints
on input sequence length will limit the amount of external information (e.g., candidate item lists) [62], leading to the
degrading performance of LLMs in scenarios such as sequential recommendation. Furthermore, since information in
recommendation tasks is challenging to be expressed in natural language [63, 126], it is hard to formulate appropriate
prompts that make LLMs truly understand what they are required to do, leading to unexpected performance.
2.2.2 LLM Improves Recommendation Models. This method mainly utilizes LLMs to generate auxiliary information
to enhance the performance of recommendation models [16, 107, 112], based on the reasoning abilities and common
6
Prompting Large Language Models for Recommender Systems Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
knowledge. The research on how to improve recommendation models with LLMs can be divided into three categories,
i.e., LLM as feature encoder, LLM for data augmentation and LLM co-optimized with domain-specific models.
• LLM as feature encoder. The representation embeddings of users and items are important factors in classical
recommender systems [28, 84]. LLMs serving as feature encoders can generate related textual data of users and items,
and enrich their representations with semantic information. U-BERT [80] injected user representations with user
review texts, item review texts and domain IDs, augmenting the contextual semantic information in user vectors. Wu et
al. [114], on the other hand, employed language models to generate item representations for news recommendation.
With the development of LLMs and prompting strategies, BDLM [134] constructed the prompt consisting of interaction
and contextual information into LLMs, and obtained top-layer feature embeddings as user and item representations.
• LLM for data augmentation. For this paradigm, LLMs are required to generate auxiliary textual information for
data augmentation [1, 66, 71, 112]. By using prompting or in-context learning strategies, the related knowledge will be
extracted out in different text forms to facilitate recommendation tasks [16, 107, 119]. One form of auxiliary textual
information is summarization or text generation, enabling LLMs to enrich representations of users or items [110]. For
example, Du et al. [16] proposed a job recommendation model which utilized the capability of LLMs for summarization
to extract user information and job requirements. Considering item descriptions and user reviews, KAR [119] extracted
the reasoning knowledge on user preferences and the factual knowledge on items through specifically designed prompts,
while SAGCN [66] utilized a chain-based prompting strategy to generate semantic information. Another form of
using the textual features generated from LLMs is for graph augmentation in the recommendation field. LLMRG [107]
leveraged LLMs to extend nodes in recommendation graphs. The resulting reasoning graph was encoded using GNN,
which served as additional input to enhance sequential models. LLMRec [112] adopted three types of prompts to
generate information for graph augmentation, including implicit feedback, user profile and item attributes.
• LLM co-optimized with domain-specific models. The categories mentioned above mainly focus on the impact of
common knowledge for domain-specific models [110]. However, LLM itself often struggles to handle domain-specific
tasks due to the lack of task-related information [40, 125]. Therefore, some studies conducted experiments to bridge the
gap between LLMs and domain-specific models. BDLM [134] proposed an information sharing module serving as an
information storage mechanism between LLMs and domain-specific models. The user embeddings and item embeddings
stored in the module were updated in turn by the LLM and the domain-specific model, enhancing the performance
of both sides. CoLLM [136] combined LLMs with a collaborative model, which formed collaborative embeddings for
LLM usage. By tuning LLM and collaborative module, CoLLM showed great improvements in both warm and cold-start
scenarios. In conversational recommender systems, approaches such as ChatRec [20] and InteRecAgent [35] considered
LLMs as the backbone, and leveraged traditional recommendation models for candidate item retrieval.
In addition to the context limitation and computational cost of LLMs [96], the paradigm that LLM improves recom-
mendation models also encounters other problems. (1) Although LLMs can enhance offline recommender systems to
avoid online latency, this paradigm also limits the ability of LLMs to model real-time collaborative filtering information,
neglecting the key factor for recommendation [110, 112, 119]. (2) Feature encoding, data augmentation, and collaborative
training inevitably expose the user data to LLMs, which may bring privacy, security and ethical issues [6, 87, 113].
2.2.3 LLM as Recommendation Simulator. Due to the gap between offline metrics and online performance of recom-
mendation methods [29, 77], it is necessary for the designed approach to get intents of users by simulating real-world
7
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Lanling Xu et al.
elements. In this way, LLM as the recommendation simulator is introduced by taking LLMs as the foundational archi-
tecture of generative agents, and agents simulate the virtual users in the recommendation environment [101, 130, 132].
Recently, there emerged a lot of work studying the performance of LLMs as the recommendation simulator. Agent4rec [130]
was a movie simulator consisting of two core fractions: LLM-empowered generative agents and recommendation
environment. The work equipped each agent with user profile, memory and actions modules, mapping basic behaviors
of real-world users. AgentCF [132], on the other hand, considered not only users but also items as agents. It captured
the two-sided relations between users and items, and optimized these agents by prompting them to reflect on and adjust
the misleading simulations collaboratively [132]. Moreover, in addition to behaviors within the recommender system,
RecAgent [100, 101] took external influential factors of user agent simulation into account, such as friend chatting and
social advertisement. In order to describe users accurately, RecAgent applied five features for users, and implemented
two global functions including real-human playing and system intervention to operate agents flexibly.
Although LLM as recommendation simulator aims to imitate real-world recommendation behaviors to enhance the
recommendation performance, it still has deficiencies in some aspects. Firstly, since current work is mainly demo
systems that operate a few agents [100, 101], there still exists a gap between virtual agent environment and real-world
practical recommendation applications, which requires further research and development. Additionally, LLMs may
arise privacy and safety concerns. Many studies take ChatGPT as the architecture of agents, presenting security risks to
the recommended information for users [132]. Moreover, Zhang et al. [130] have explored that hallucination in LLMs
can exert huge impact on recommendation simulations. The LLM sometimes fails to accurately simulate human users,
such as providing inconsistent score for an item and fabricating non-existent items for rating.
Compared to previous work, we concentrate on the ability of LLMs leveraged for recommendation tasks, and provide a
systematic empirical analysis on prompting LLM-based recommendations by devising a general framework ProLLM4Rec.
We mainly focus on two aspects, i.e., LLMs and prompt engineering, providing definitions and solutions from both
conceptual and methodological perspectives. Furthermore, we conduct experiments to discover new findings and
validate results previously discussed in existing research, serving as an inspiration for future research efforts.
framework can be generalized to various recommendation scenarios and tasks by framework instantiation (Section 3.2).
Finally, we introduce the details of the overall experimental settings for further analysis (Section 3.3).
3.1.1 Key Elements in Our Framework. To carry out the experiments, we first describe the key elements in ProLLM4Rec,
to clarify their definitions and scope in this work. Specially, we introduce the following five elements for our framework:
• Large language models. As proposed in previous research [2, 32, 53, 136], there exists a large gap between general
language modeling and personalized user behavioral modeling, making it non-trivial to utilize LLMs in recommender
systems. In this work, we investigate the efficacy of LLMs from perspectives of public availability, tuning strategies,
model architecture, parameter scale, and context length, aiming to gain insights into the selection of appropriate LLMs
for performing recommendation tasks. Observations and discussions on the use of LLMs are presented in Section 4.
• Task description. To adapt LLMs to the scenario of recommendation, it is necessary to clearly express the context
and target of recommender systems for LLMs in the prompt, i.e., task description. With different prompt descriptions of
tasks, large language models can be utilized to various recommendation scenarios and tasks such as click-through rate
predictions [2, 3, 40], sequential recommendation [12, 32, 141] and conversational recommender systems [20, 30, 105].
• User interest modeling. The modeling of user interest is the key to recommendation tasks [29, 84]. When leveraging
LLMs for recommendation, users are generally expressed in natural language text, which is different from traditional
9
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Lanling Xu et al.
LLMs Task description User interest Candidate items Prompting strategies Related Work
Not tuning setting
CTR predictions, recent and relevant pointwise, pairwise, chain of thoughts, in- [12, 14, 32, 40, 58, 65,
rating, re-ranking items (with attributes), listwise item(s) context learning, role 68, 69, 74, 89, 99, 108,
user profile prompting 109, 125, 142]
ChatGPT, conversational rec- user explicit interest, recalled from tradi- role prompting [18, 20, 30, 35, 38,
GPT-4 ommender systems interactive feedback tional models 105, 145]
generative recom- recent items (with at- (not provided, gen- basic prompts [71, 103]
mendation tributes), user profile eration methods)
(Parameter-efficient) Fine-tuning setting
LLaMA, CTR predictions, recent and relevant pointwise, listwise chain of thoughts, role [3, 14, 19, 40, 55, 60,
LLaMA2, rating, re-ranking items (with attributes), item(s) prompting, soft prompt- 62, 88, 98, 122, 128,
Vicuna, user profile, collabora- ing 136]
ChatGPT tive embedding
LLaMA, recall, retrieving recent and relevant (not provided, item basic prompts [2, 37, 48, 54, 63, 78,
LLaMA2, items (with attributes) grounding methods) 135, 141]
BART, GPT
Instruction tuning setting
(Flan-)T5, rating, ranking, recently interacted pointwise, pairwise, basic prompts [9, 21, 57, 72, 79,
LLaMA2 retrieving, explana- items, user profile, listwise item(s) 133]
tion, summarization short-term intentions
approaches capturing user preference from ID-based behavior sequences [39, 115]. In this paper, we mainly consider
the reflected user interest based on his or her interaction behaviors with interacted items. Especially, as detailed in
Section 5.2, we employ item description texts, user profiles, and historical interactions between users and items to
reveal the underlying user interest using natural languages [62, 89, 108, 125].
• Candidate items construction. The purpose of recommender systems is to provide users with items to choose from,
so candidate items construction is a crucial step in our framework [12, 32, 133]. A simple approach is to provide several
candidate items in prompts for the LLM, e.g., the items recalled by traditional recommendation models [62, 128]. Due
to the input length limitation of LLMs, it is not possible to include all items in the prompts. In addition to selecting
suitable candidate sets, there are also methods that directly generate candidate items by LLMs, utilizing strategies such
as output probability distribution [128] and vector quantization [141] for item indexing and grounding. Section 5.3 will
focus on the construction strategies of candidate items, including selection and generation.
• Prompting strategies. Despite the impressive capabilities of LLMs, they tend to exhibit unsatisfactory performance
in providing personalized recommendations [12, 32, 40, 68]. The reason may stem from the significant semantic gap
between the general knowledge encoded in LLMs and the domain-specific behavioral pattern and item catalogs of
recommender systems [53, 136]. To specialize LLMs to recommender systems, we summarize and propose several
prompting strategies specialized for recommendation tasks. Details will be discussed in Section 5.4.
10
Prompting Large Language Models for Recommender Systems Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
3.3.1 Datasets. The domain characteristics of movies and books are closer to the general knowledge of LLMs, which
facilitates the further analysis. Considering the scale, popularity and side information of public datasets, we select two
representative datasets to conduct our study, i.e., MovieLens-1M [25] and Amazon Books (2018) [76] as follows.
• MovieLens-1M [25] is one of the most widely used benchmark datasets in the field of recommender systems, covering
movie ratings and attributes on the website movielens.org. We use the one million version from the MovieLens datasets.
• Amazon Books (2018) [76] is an updated version of the Amazon review dataset released in 2014. At first, Amazon
only operated online book sales business, so the data in the book field is the most abundant. To improve the data quality,
we filter out inactive users and unpopular products, and remove dirty data without necessary attributes.
In our research, we are concerned about how LLMs can fully utilize the domain knowledge to make recommendations,
and use the title of items as the input for prompts. However, titles are not enough to describe items, and there are
deviations between the text of the title and the content of the item itself (e.g., the movie Twelve Monkeys). Therefore,
we further investigate the benefits of detailed item descriptions on the recommendation effect. As shown in Table 3,
there are no item descriptions in the original dataset of MovieLens, only the release year, title, and genre. To enrich the
movie dataset, we use the general knowledge of ChatGPT 1 to generate text descriptions for movies.
3.3.2 Configuration and Implementation. As for ProLLM4Rec, we can directly evaluate the cold-start recommendation
ability of large language models in the zero-shot setting, as well as evaluate the fine-tuned ability with few or full
1 The
URL of ChatGPT API: https://chat.openai.com/. Note that there are multiple versions of the ChatGPT API, and OpenAI has released interfaces on
March 1 and June 13, 2023, respectively. In the absence of clear annotations, the ChatGPT used in this article is “gpt3.5-turbo-4k-0613”.
11
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Lanling Xu et al.
Input:I've
Input: I'vewatched
watchedthe
thefollowing
followingmovies
moviesininthe
thepast
pastininorder:
order:
{historicalinteractions
{historical interactionsofofusers}.
users}.Note
Notethat
thatmymymost
mostrecently
recently Instruction:
Input: Given
Given the the preference
user's user's preference and unpreference,
and unpreference, identify
watchedmovie
watched movieisis{recent
{recentinteracted
interacteditem}.
item}.Now
Nowthere
thereareare2020 identify whether
whether the like
the user will userthe willtarget
like the target
movie bymovie by answering
answering "Yes."
candidatemovies
candidate moviesthat
thatI Ican
canwatch
watchnext:
next:{candidate
{candidateitems}.
items}. "Yes."
or or "No.".
"No.".
Pleaserank
Please rankthese
these20
20movies
moviesby bymeasuring
measuringthethepossibilities
possibilitiesthat
thatI I Input:
User User Preference:
Preference: {historical
{historical items
items with with ahigher
a rating ratingthan
higher
or than
wouldlike
would liketotowatch
watchnext
nextmost,
most,according
accordingtotomymywatching
watchinghistory.
history. or equal
equal to the
to the rating
rating threshold}
threshold}
Pleasethink
Please thinkstep
stepby
bystep.
step.Please
Pleaseshow
showmemeyour
yourranking
rankingresults
results UserUnpreference:
User Unpreference:{historical
{historicalitems
itemswith
withaarating
ratinglower
lowerthan
thanthe
the
withorder
with ordernumbers.
numbers.Split
Splityour
youroutput
outputwith
withline
linebreak.
break. ratingthreshold}
rating threshold}
YouMUST
You MUSTrank rankthe
thegiven
givencandidate
candidatemovies.
movies.YouYoucan
cannot
not Whetherthe
Whether theuser
userwill
willlike
likethe
thetarget
targetmovie
movie{target
{targetitem}?
item}?
generatemovies
generate moviesthat
thatare
arenot
notininthe
thegiven
givencandidate
candidatelist.
list. Output:"Yes."
Output: "Yes."or
or"No.".
"No.".
Output:{Candidate
Output: {Candidateitems
itemsafter
afterre-ranking}.
re-ranking}.
(a) Zero-shot ranking task [32] (b) Fine-tuned CTR prediction task [3]
Fig. 2. The basic prompts used in the MovieLens-1M dataset for the following experiments.
recommendation samples in the fine-tuning setting. Considering the typical scenarios of LLMs in recommender systems,
we conduct experiments on two representative task settings, i.e., (1) the zero-shot ranking task without modifying
parameters of LLMs [12, 20, 32, 74] and (2) the click-through rate prediction task with LLMs tuned [3, 19, 40, 88].
• Zero-shot ranking task. On the one hand, we evaluate the zero-shot recommendation performance of LLMs
for cold-start scenarios to study the effect of LLMs and the design of prompts. In this paper, our approach mainly
concentrates on ranking tasks that better reflect the capabilities of LLMs [12, 20, 32, 74]. As shown in Fig. 1, information
of users and items is encoded into the prompt as inputs for LLMs. In this setting, we do not modify the parameters of
LLMs, so the evaluated models are closed-source large language models or open-source models without fine-tuning. To
conduct experiments on the impact of each factor on recommendation results with LLMs, we implement the overall
architecture based on the open-source recommendation library RecBole [121, 138, 139] and the zero-shot re-ranker
LLMRank [32]. Our basic prompt used in the MovieLens-1M dataset for the zero-shot ranking task is shown in Fig. 2(a).
• Click-through rate prediction task with LLMs tuned. On the other hand, we evaluate the fine-tuned recommen-
dation performance of LLMs to explore how LLMs adapt to recommendation scenarios with data provided [3, 19, 40, 88].
Although our framework can be generalized to various recommendation tasks, we concentrate on exploring the fine-
tuning performance of LLMs with point-wise Click-Through Rate (CTR) prediction tasks to reduce selection bias. In this
setting, we not only consider fine-tuning LLMs using recommendation data, but also devise a two-stage approach of
using instruction data to fine-tune LLMs first, and then implement recommendation fine-tuning for further adaptation.
Specifically, we compare the fine-tuned recommendation performance of the original LLM and the LLM after instruction
tuning, respectively. LLaMA-7B [95] and LLaMA2-7B [96] are the original models, while Alpaca-lora-7B [92] and
LLaMA2-chat-7B [96] are LLMs after instruction tuning. In terms of the tuning strategies of LLMs, we report results
with both parameter-efficient fine-tuning and complete fine-tuning. We implement the fine-tuning framework based on
the open-source library transformers and the instruction tuning code of LLaMA with Stanford Alpaca data 2 . Our
instruction data used in the MovieLens-1M dataset for the click-through rate prediction task is illustrated in Fig. 2(b).
3.3.3 Evaluation Metrics. As for the zero-shot ranking task, considering economic and efficiency factors, we refer to
existing literature [32, 89] to randomly sample 200 users instead of evaluating results on the whole dataset. For each
2 The repository of Alpaca-LoRA: https://github.com/tloen/alpaca-lora.
12
Prompting Large Language Models for Recommender Systems Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
user from the sample set, we sort all items that the user interacts with in the chronological order. Then, we evaluate
the results based on the leave-one-out strategy and treat the last interacted item as the ground truth. For performance
comparison, we fix the length of candidate items to 20 as in [32], and mix the other 19 items from the ground truth in
random positions by default. As for evaluation, we utilize two widely used ranking metrics in recommender systems,
i.e., Recall [5] and Normalized Discounted Cumulated Gain (NDCG) [36]. Since 20 items are selected for candidate
generation, we set 𝑘 = 20 for Recall@𝑘 to measure whether all ground-truth items have been recalled. In existing
literature, the re-generation method of multiple results until the format requirements are met can be employed to obtain
the final response [56], while we consider the recall capability within one inference for fair evaluation. Furthermore,
we set 𝑘 = 1, 10, 20 for NDCG to explore the detailed recommendation performance in terms of ranking abilities. To
assure a scientific research, we repeat each experiment three times and take the average value as the final result.
As for the CTR prediction task, we first sort the original dataset by timestamp, use the latest 10,000 records for
training and evaluation, and regard other previous data as the interaction history of users. Then, we split the ten
thousand interactions into the training, validation and test sets in ratio of 8:1:1. For each interaction, we retain 10
historical interacted items as user representations, and set thresholds based on rating data to obtain user preferences [3].
Interactions with a rating higher than or equal to the threshold will be considered items that the user likes, while the
opposite indicates dislikes. For the MovieLens-1M dataset, the rating threshold is 4, and we set 5 for the Amazon-Books
dataset. We employ the training data to fine-tuning LLMs and evaluate the recommendation performance on the test
set. The validation set is used for selecting best checkpoints during the training process. As for evaluation of prediction,
we utilize the widely used metric for CTR predictions, i.e., accuracy. In a random state, the accuracy is around 0.5.
3.3.4 Discussion on Variable Factors. To conduct our analysis, we mainly focus on two key aspects, i.e., LLMs and
prompts. As for the effects of LLMs, we analyze the impact of public availability, tuning strategies, model architecture,
parameter scale, and context length on recommendation results based on the classification of LLMs. As for prompt
engineering, we further analyze the impact of four important components of prompts, i.e., task description, user interest
modeling, candidate items construction and prompting strategies. Due to the fact that all factors have an impact on
the final result, it is also crucial to select the other aspects when focusing on one aspect. Limited by resources and
efficiency, it is neither necessary nor feasible to exhaust all possibilities. When not explicitly specified, the LLM uses
ChatGPT released in June 2023 to ensure the consistent recommendation quality. For the design of prompts, we refer to
the template in [32] and emphasize the most recent items to re-rank the random 20 candidate items. The default method
for modeling user interest is to concatenate the title sequence of recently interacted items using natural languages.
LLMs can be divided into open-source and closed-source models in terms of the public availability. When it comes to
leveraging LLMs as the foundation model in recommender systems, tuning strategies can adjust LLMs towards specific
recommendation tasks. From the perspective of the model architecture, various LLMs can also be categorized into
types of encoder-decoder, causal decoder, and prefix decoder [140]. For the same framework of a LLM, it is widely
acknowledged that the parameter scale and context length are two key factors that jointly determine abilities of
LLMs [32, 62]. To explore the recommendation performance with respect to different variants of LLMs, we focus on
five aspects, i.e., public availability, tuning strategies, model architecture, parameter scale and context length as follows.
4.1.1 Public Availability. According to whether the model checkpoints can be publicly obtained, existing LLMs can be
divided into open-source models and closed-source models, and both can be leveraged as recommender systems.
• Open-source models refer to LLMs whose model checkpoints can be publicly accessible. As shown in Table 2,
researchers often use recommendation data to fine-tune open-source models for performance improvement. As the
representative of open-source models, LLaMA [95] and its variants like Vicuna [8] are widely used when leverag-
ing LLMs for recommender systems [54, 141]. Parameter-Efficient Fine-Tuning (PEFT) strategies such as Low-Rank
Adaptation (LoRA) [33] are frequently adopted for recommendation data considering the trade-off between effect and
efficiency [3, 60]. Other models such as the Flan-T5 [10] series from Google Inc. and ChatGLM [129] from Tsinghua
University are also popular in the field of recommender systems. The publicly available checkpoints of open-source
models provide flexibility for LLMs to modify parameters tailored for recommendation tasks.
• Closed-source models refer to LLMs whose model checkpoints can not be publicly accessible. For the closed-source
LLMs utilized for recommenders, researchers generally study the zero-shot recommendation ability in cold-start
scenarios. The most typical closed-source model is the ChatGPT series from OpenAI. The subsequent GPT-4 has
stronger capabilities compared to ChatGPT, but it is still not open-source. In this paper, ChatGPT refers to the API
function of “gpt3.5-turbo-4k-0613” unless specified. Without checkpoints, OpenAI provides several ways for researchers
and users to improve the model performance on specific tasks, such as plugins for website browsing [45] and interfaces
for fine-tuning ChatGPT [55]. However, the flexibility of closed-source models as recommender systems is still limited
due to the expensive price and black-box parameters. Faced with this challenge, existing literature has explored to
inject knowledge of recommender systems into closed-source models by means of prompt design [99, 125], retrieval
enhancement [35, 89, 108] and combination of traditional recommendation models [20, 105].
4.1.2 Tuning Strategies. During the deployment of LLMs in recommender systems, we can also classify existing work
depending on whether LLMs are fine-tuned, as well as the various fine-tuning strategies employed, i.e., not-tuning
setting, fine-tuning setting and instruction tuning setting.
• Not-tuning setting means evaluating the zero-shot recommendation ability of LLMs, which is generally used
for closed-source models such as ChatGPT. By designing different prompt templates, LLMs without fine-tuning can
be directly used for recommendation tasks such as click-through rate predictions [14, 40], sequential recommen-
dation [32, 108], and conversational recommender systems [30, 38, 105]. In this case, the user interest is expressed
explicitly (e.g., ratings and reviews) or implicitly (interacted items of users), and the limited candidate items can be
recalled by traditional models, while prompting strategies such as role prompting and chain of thoughts are used.
Inspired by Artificial Intelligence Generated Content (AIGC), it is worth noting that the excellent generation ability of
LLMs provides opportunities for generative recommendation [71, 103]. Without providing candidate items, generative
14
Prompting Large Language Models for Recommender Systems Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
language models can directly generate the desired items that users need based on recommendation requirements, and
they can also be generalized into our framework ProLLM4Rec as shown in Table 2.
• Fine-tuning setting means that using recommendation data to fine-tune LLMs as recommender systems. Considering
cost and efficiency, researchers often use parameter-efficient fine-tuning (e.g., Low-Rank Adaptation of Large Language
Models, LoRA [33]) to quickly adapt to recommendation scenarios [3, 60, 141]. As for LLMs, open-source models based
on LLaMA [95] are widely used, including but not limited to LLaMA, LLaMA2 and Vicuna [8] with different parameter
sizes. Based on whether candidate items are provided, existing work on fine-tuning LLMs can be further divided into two
kinds of recommendation tasks. On the one hand, researchers have explored fine-tuning models for recommendation
that provide candidate items such as rating, re-ranking and predictions. Specifically, the fine-tuning interface of the
closed-source model ChatGPT has brought new breakthroughs to the research of LLMs, and there have been attempts
to fine-tune ChatGPT for recommendation tasks [55]. On the other hand, LLMs can also be fine-tuned for recall tasks
of recommender systems by retrieving candidates from the whole item pool [2, 54, 63, 78]. Through well-designed
indexing, alignment, and retrieval strategies, directly generating recommendation items without providing lengthy
candidate sequences is more suitable for practical application scenarios, which has not yet been fully explored.
• Instruction tuning setting means providing template instructions of recommenders as prompts to tuning tar-
geted LLMs, generally involving multiple recommendation tasks [9, 21, 79, 133]. However, some fine-tuning meth-
ods (e.g., TALLRec [3]) also involve the form of instructions. In order to avoid ambiguity, we consider the instructions
of a single task template as fine-tuning settings, and the instructions of multiple tasks as instruction-tuning settings. As
the backbone of recommender systems, researchers generally use T5 or Flan-T5 for various recommendation scenarios
such as rating, ranking, retrieving, explanation generation and news recommendation. Besides Flan-T5, RecRanker [72]
is also proposed to instruction tuning LLaMA2 as the ranker for top-𝑘 recommendation. By instantiating appropriate
instructions for each recommendation task, this setting can also be summarized into our framework.
4.1.3 Model Architecture. For the architecture design of LLMs leveraged as recommender systems, we consider
three mainstream architectures as summarized in [140], i.e., encoder-decoder, prefix decoder, and causal decoder. The
recommendation scenarios for each framework are introduced in what follows.
• The encoder-decoder architecture adopts two Transformer blocks as the encoder and decoder, which is the basis of
BERT [97]. In line with existing literature on LLM-based recommender systems, the bidirectional property of the encoder-
decoder architecture allows LLMs to easily customize encoders and decoders towards recommendation (e.g., dual-encoder
considering both ids and text [79]), and conveniently adapt to multiple recommendation tasks [9, 21, 57, 79, 133]. A few
LLMs use the encoder-decoder architecture, and the typical one is the series of T5 and its variant Flan-T5 [10].
• The prefix decoder architecture is also the decoder-only architectures, which is known as non-causal decoder. It
can bidirectionally encode the prefix tokens like the encoder-decoder architecture, and perform unidirectional attention
on the generated tokens like the causal decoder architecture. One of the representative LLMs based on the prefix
decoder architecture is ChatGLM [129] and its variants, and researchers have attempted to explore the recommendation
performance of ChatGLM as one of the benchmarks in related work [69].
• The causal decoder architecture has been widely adopted in various LLMs, and the series of GPT [4] and LLaMA [95]
are the most representative models. It uses the unidirectional attention mask, and only decoders are deployed to process
both the input and output tokens. Due to the popularity of the causal decoder architecture, most LLM-based recommender
15
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Lanling Xu et al.
systems employ this framework to adapt to different recommendation tasks such as click-through rate predictions,
sequential recommendation and conversational recommender systems.
4.1.4 Parameter Scale. To meet the diverse needs of different users, the most typical variant of LLMs is the parameter
scale. In general, open-source models have multiple parameter sizes to choose from, and larger parameter sizes
generally mean better capabilities [95, 140]. But meanwhile, the corresponding computational and spatial complexity
will also increase. Considering the memory and efficiency issues for experiments, researchers in the field of LLM-based
recommender systems generally use LLMs with parameters no more than 10B, while the performance of LLMs with
larger parameters remains to be further explored in the field of recommender systems [40].
4.1.5 Context Length. Another property closely related to user needs is the length of the input context. The inability
to handle inputs with longer contexts means that decision-making cannot be accurately developed, thereby limiting the
model capabilities [96, 140]. When the user input exceeds the limited length, the input will be truncated, so sufficient
context length is crucial for the user experience [12, 32, 74]. However, different lengths of context inputs imply different
model architectures and parameters. When expanding the context length of a model, it often leads to higher time and
memory complexity. To address the length limitation of LLMs, existing methods either selectively discard previous
contexts using sliding windows [120], or only sample a portion of the context for retrieval augmentation [45, 62], or
employ small models without emergence ability. Despite recent strategies, the limitation of context length has not yet
been truly resolved. Considering economic and efficiency issues, existing LLMs including open-source and closed-source
models only provide a limited number of context length options. In this paper, we mainly focus on several classic
lengths, i.e., 2K, 4K, 8K, 16K and 32K (K is the abbreviation for one thousand, similarly hereinafter).
• RQ1: What are the differences between the recommendation ability of LLMs and traditional recommenders?
• RQ2: How do the different attributes of LLMs, including the public availability, model architectures, parameter
scales, and context lengths affect the recommendation performance and inference time?
• RQ3: What are the similarities and differences in recommendation results of LLMs with different tuning strategies?
Is the LLM after instruction tuning more suitable for recommendation tasks?
4.2.2 Evaluated Models. As for the experimental settings, we consider the following baselines and LLMs.
• Random: Random baseline recommends the 𝑘 (k=20 in this section) candidate items in a random order, which is
the basic situation to evaluate the metric values of each dataset.
• Pop: Pop method always ranks the candidate items based on their interaction times with users in the training set.
We consider it as the fully-trained method since it uses the statistical information of datasets.
• BPR [84]: BPR is one of the typical traditional models that utilize matrix factorization for recommendation. It is
trained in the pair-wise paradigm a.k.a., BPR training loss without considering temporal information.
16
Prompting Large Language Models for Recommender Systems Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
• SASRec [39]: SASRec is a sequential recommendation model based on the backbone of the classic self-attention
network a.k.a., Transformer [97], and achieves comparable performance among sequential models.
• ChatGPT: ChatGPT is a closed-source large-scale pre-trained language model developed by OpenAI. Note that
OpenAI has released interfaces of ChatGPT on March 1 and June 13, 2023, respectively. Considering the up-to-date
requirements, we adopt the version on June 13 for recency, the same as GPT-4.
• GPT-4: GPT-4 is the latest generation of closed-source natural language processing model launched by the OpenAI
company. Experiments have shown that GPT-4 is significantly superior to ChatGPT in multiple tasks.
• Flan-T5 [10]: Flan-T5 is an open-source language model based on the encoder-decoder architecture T5 released by
Google [81]. Flan-T5 is extended from T5 by a multi-task fine-tuning paradigm i.e., instruction tuning to enhance
the generalization of different tasks. There are multiple variants of Flan-T5 in terms of parameters, including
Flan-T5-Small (80M), Flan-T5-Base (250M), Flan-T5-Large (780M), Flan-T5-XL (3B) and Flan-T5-XXL (11B). Since
the first three models are too small to meet the requirements of LLMs discussed in this paper (1B, B is short for
billion and the same below), we consider the Flan-T5-XL and Flan-T5-XXL for comparison.
• ChatGLM [129]: ChatGLM is an open-source bilingual dialogue LLM that supports both Chinese and English,
based on the General Language Model (GLM) [129] with the prefix decoder architecture. The team released the
second version ChatGLM2 and the third version ChatGLM3 in June 2023 and October 2023, respectively.
• LLaMA [95]: LLaMA is an open-source language model introduced by MetaAI from the causal decoder architecture
with four sizes (6B, 13B, 35B and 65B). Due to its outstanding performance and low computational cost, LLaMA has
received much attention from researchers so far, and Vicuna [8] is one of the most popular variants by extending
LLaMA. To further improve abilities of LLaMA, MetaAI released LLaMA2 as the next generation of open-source
large language models in July, 2023. In addition to the regular version, MetaAI also provide the chatting version of
LLaMA2 (i.e., LLaMA2-chat) [96], and it is specifically tuned for the dialogue scenario by Reinforcement Learning
with Human Feedback (RLHF).
• Recommendation performance of LLMs. As shown in Table 4, we provide the fully-trained results of four traditional
methods, as well as the zero-shot recommendation performance of various LLMs. For traditional recommenders, BPR [84]
based on collaborative filtering is significantly better than Pop based on popularity, and the sequential recommendation
model SASRec [39] combined with attention mechanism and temporal information is significantly better than BPR,
which is consistent with the results in existing literature [39, 84, 144]. It is worth noting that for the 20 candidate
items, LLMs that rely on natural languages cannot completely recall all items in most cases, and several models with
poor abilities can only output a dozen items, greatly limiting the accuracy of recommendation results. Therefore,
recall@20 indicates the ability of LLMs to memorize, re-rank and output candidate items, and several approaches such
as the re-generation method [56] and probability distribution outputs [128] have been proposed to improve the recall
17
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Lanling Xu et al.
Table 4. Overall performance of different models on recommendation. We consider both the fully-trained settings for traditional
models and zero-shot settings for LLMs. Note that there are always ground-truth items in the randomly selected 20 candidates, so
the ideal recall@20 equals to 1. “IT” is for “Inference Time”, and we record the average inference time for each user measured in
seconds (s). “N/A” is the abbreviation for “Not Applicable” since the inference time of closed-source models is unknown.
MovieLens-1M Amazon-Books
context param.
model
length size recall@20 ndcg@1 ndcg@10 IT (s) recall@20 ndcg@1 ndcg@10 IT (s)
Fully-trained settings for traditional models
Random - 0 1.0000 0.0300 0.2081 0.01 1.0000 0.0350 0.2628 0.01
Pop - 1 1.0000 0.1800 0.4841 0.03 1.0000 0.1000 0.2672 0.03
BPR - <1M 1.0000 0.2550 0.5743 0.04 1.0000 0.2950 0.6236 0.04
SASRec - <1M 1.0000 0.6400 0.7916 1.07 1.0000 0.6800 0.8305 1.49
Zero-shot settings for LLMs
Closed-source LLMs
4K - 0.9583 0.1817 0.3985 n/a 0.9850 0.2467 0.4276 n/a
ChatGPT
16K - 0.9600 0.1500 0.3735 n/a 0.9800 0.2400 0.4032 n/a
GPT-4 8K - 0.9900 0.3100 0.5828 n/a 1.0000 0.3300 0.5631 n/a
Open-source LLMs with the encoder-decoder architecture
3B 0.0050 0.0000 0.0016 3.51 0.0000 0.0000 0.0000 4.33
Flan-T5 0.5k
11B 0.0050 0.0000 0.0016 5.21 0.0000 0.0000 0.0000 8.28
Open-source LLMs with the prefix decoder architecture
ChatGLM 2K 6B 0.7750 0.0300 0.1945 19.12 0.7000 0.0350 0.2026 19.52
ChatGLM2 32K 6B 0.1900 0.0450 0.0885 11.06 0.1600 0.0250 0.0680 13.62
2K 6B 0.6900 0.0950 0.2762 7.95 0.6550 0.0450 0.2273 10.79
ChatGLM3
32K 6B 0.7750 0.0700 0.2579 14.06 0.7050 0.0550 0.2068 16.89
Open-source LLMs with the causal decoder architecture
7B 0.2700 0.0350 0.1068 9.11 0.2650 0.0200 0.0992 9.35
13B 0.2500 0.0300 0.1028 9.51 0.2250 0.0250 0.0867 9.94
LLaMA 2K
33B 0.3900 0.0400 0.1328 92.88 0.2950 0.0350 0.1015 106.61
65B 0.5300 0.0450 0.1913 171.39 0.3750 0.0500 0.1253 182.51
7B 0.2650 0.0500 0.1078 8.17 0.4400 0.0550 0.1559 11.12
Vicuna 2K
13B 0.4100 0.0550 0.1507 9.26 0.4800 0.0700 0.1813 12.72
7B 0.2350 0.0500 0.0888 9.97 0.5600 0.0650 0.1648 10.01
13B 0.4500 0.0700 0.1215 15.64 0.5100 0.0750 0.2150 14.44
LLaMA2 4K
70B 0.7600 0.1250 0.2918 24.38 0.6600 0.1150 0.2912 24.63
7B 0.7950 0.0900 0.2744 9.80 0.7050 0.1500 0.3217 9.86
LLaMA2 13B 0.8050 0.1650 0.3866 14.80 0.7500 0.1650 0.3411 12.25
4K
(chat)
70B 0.9550 0.2430 0.4344 23.07 0.8850 0.2230 0.3827 25.01
performance. While for the traditional models, all candidate items can be recalled (i.e., recall@20 equals to 1). Because
the candidate items are not recalled completely, the recommendation effect of a few LLMs (e.g., Flan-T5 and ChatGLM)
18
Prompting Large Language Models for Recommender Systems Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
NDCG@10
0.60 0.60
0.45 0.45
0.30 0.30
1 5 10 20 30 40 50 1 5 10 20 30 40 50
The number of historical items The number of historical items
(a) MovieLens-1M (b) Amazon-Books
Fig. 3. The recommendation performance of LLMs w.r.t. the number of historical item sequences.
is not even as good as the random baseline. For most LLMs, the zero-shot recommendation performance is not as good
as the baseline method Pop based on popularity of interactions in the dataset [12, 32, 68]. However, the powerful LLMs
like ChatGPT and LLaMA-70B (chat) [96] can achieve better results than Pop in zero-shot settings. Furthermore, GPT-4
can even perform better than the fully-trained matrix factorization model BPR on two datasets, indicating the potential
of LLMs to serve as the backbone of recommender systems. In addition, the significant differences between the results
of LLMs demonstrate the importance of selecting appropriate LLMs for downstream recommendation tasks [40, 68, 69].
• The impact of historical item sequences. As for ranking tasks in Fig. 2(a), the recently interacted historical items
are used as the user representations. However, there is no standard value for the number of items that represent users.
To analyze the impact of historical item sequences on the recommendation performance, we conduct experiments to
explore the recommendation effect of the sequential recommender SASRec [39] and the closed-source LLM ChatGPT
with different numbers of historical interactions. For the fully-trained SASRec, the maximum length of the historical
item sequence will affect the model framework and prediction results [31, 39]. In order to ensure the model uniformity,
we fix the model checkpoint with the historical item sequence of 50 as the “SASRec (fixed)” for comparison, and evaluate
the recommendation performance (NDCG@10 [36]) with the number of historical items at 1, 5, 10, 20, 30, 40 and 50,
respectively. For ChatGPT and GPT-4, we get zero-shot results with different items to verify whether the powerful
LLMs can deal with the long context for recommendation. As illustrated in Fig. 3, with the increasing number of
historical items, the results of SASRec improve steadily, while the the recommendation performance of LLMs changes
little. Consistent conclusions on two datasets can be drawn that even if LLMs can accept more historical items for user
representations, increasing the number of historical items does not bring significant gains in the recommendation
performance. The performance trends of LLMs show that the increased historical item sequence is not fully utilized
by the language model, indicating the importance of selecting appropriate item sequences to represent users, and the
inadequacy of LLMs for user interest mining. To improve the mining of user interest for LLMs, approaches such as
retrieval augmentation [62] and prompting strategies [108, 125] can be used, which will be analyzed in Section 5.
• Inference time of LLMs. In the actual deployment of recommender algorithms, the inference efficiency is the
decisive factor for industrial applications [91, 98]. In general, the inference time of models is closely related to the size
of parameters. As shown in the last column of Table 4, for the lightweight traditional recommenders, the inference
time of SASRec is about 1 second for each user. However for LLMs, except that closed-source models cannot accurately
19
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Lanling Xu et al.
obtain inference time due to limitations of the API, the inference time of open-source models takes nearly 10 or more
seconds for one prediction, leading to an unacceptable time delay in practical applications.
• In zero-shot scenarios, LLMs have cold-start capabilities, and GPT-4 even surpasses collaborative filtering
models. However, all LLMs are inferior to fully-trained sequential recommendation models.
• Even if LLMs can accept more historical items for user representations, increasing the number of historical
items does not bring significant gains in the recommendation performance.
• Compared to recommenders, the inference time of LLMs is unacceptable to be used for real applications.
4.3.2 LLMs on Recommendations w.r.t. Four Aspects (RQ2). For different LLMs, differences in public availability and
model architecture will lead to different recommendation scenarios, results and inference time [40, 69]. For the same
LLM, the parameter scale and context length also affect the efficiency and effectiveness of language models [140].
Therefore, we explore the impact of different LLMs on recommendations from four aspects, namely public availability,
model architecture, parameter scale and context length as follows.
• Public availability. As shown in Table 4, closed-source models achieve significantly better results than the open-
source models in the cold-start scenario, but they cannot outperform fully-trained sequential models. In terms of LLMs,
ChatGPT with zero-shot settings has comparable recommendation performance with the fully-trained Pop especially on
the sparse Amazon-Books dataset, indicating the fundamental ability of LLMs on recommendation tasks. Furthermore,
the upgraded GPT-4 exceeds ChatGPT by a large margin due to its strong zero-shot generalization ability. The superior
zero-shot performance of GPT sheds lights on leveraging LLMs for recommendation. However, the open-source models
always get poor results compared to GPT-4 in the zero-shot settings, while LLaMA2-chat-70B has the comparable
recommendation performance with ChatGPT. The reason is that open-source models lack comprehensive cold-start
capabilities, and their strength lies in the ability to integrate domain knowledge through strategies such as prompt
tuning. In line with previous studies on LLMs [32, 40, 68, 74], employing a closed-source model in cold-start scenarios
yields better results, while an open-source model is more flexible and easy to use when tuning is needed [3, 54, 72, 141].
• Model architecture. For the model architecture, Flan-T5 [10] based on the encoder-decoder architecture has almost
no ability to recommend items in the cold-start setting, as its training corpus does not involve specialized instructions of
our task. Trained with prompts of recommendations, the encoder-decoder architecture is suitable for prompt tuning and
instruction tuning [9, 21, 133]. Similarly, the first and second versions of ChatGLM [129] based on the prefix decoder
perform poorly on the zero-shot ranking task, and are not as good as Vicuna and LLaMA2 based on the causal decoder.
However, the third version of ChatGLM, i.e., ChatGLM3 has comparable recommendation performance with Vicuna [8]
and LLaMA2 [96], which further indicates the importance of selecting an advanced foundation model. In terms of the
series of LLaMA models [95], LLaMA2 is better than Vicuna, and Vicuna is better than LLaMA, which is related to
their training data and release time. Furthermore, the chat version of LLaMA2, i.e., LLaMA2-chat is a series that uses
conversational dialog instructions to fine-tune LLaMA2 [96], which is more suitable for our tasks in ranking settings.
Therefore, the results of LLaMA2-chat are significantly improved compared with LLaMA2, and LLaMA2-70B-chat even
achieves better performance than the closed-source model ChatGPT. Generally speaking, researchers prefer to study
20
Prompting Large Language Models for Recommender Systems Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
NDCG@10
0.3 70B
7B 13B
0.2
13B
7B
0.2 65B
7B 65B
13B
33B 7B 33B
13B 0.1
13B
0.1 7B
0 50 100 150 200 0 50 100 150 200
Inference Time (s) Inference Time (s)
(a) MovieLens-1M (b) Amazon-Books
Fig. 4. The recommendation performance and inference time of LLMs w.r.t. the parameter scale. A larger point means more parameters.
0.60 8k 0.60
ChatGPT 8k ChatGPT
ChatGLM3 ChatGLM3
GPT 4 0.50 GPT 4
0.50
NDCG@10
NDCG@10
4k
16k
4k 0.40
0.40 16k
0.30
0.30 2k
32k 2k
32k
0.20
0 6 12 18 24 30 36 0 6 12 18 24 30 36
Context length (k) Context length (k)
(a) MovieLens-1M (b) Amazon-Books
recommendation tasks based on the causal decoder framework such as LLaMA, and the second version of LLaMA has a
better generalization ability than the first version in recommendation tasks.
• Parameter scale. It is widely recognized that the larger the parameter size, the more powerful the LLMs [32, 40, 140],
the same applies in the field of recommender systems. To compare the effect and efficiency of LLMs w.r.t. the parameter
scale, we compare the recommendation performance and inference time of LLaMA [95], Vicuna [8], LLaMA2 [96],
and LLaMA2-chat at different parameter scales in Fig. 4. As the scale of parameters enlarges, the recommendation
performance and inference time of LLMs steadily increases, and both datasets (Fig. 4(a) and Fig. 4(b)) have consistent
conclusions. Therefore, it is necessary to consider the trade-off between performance and efficiency when choosing the
parameter scale. Moreover, the performance improvement on the increasing scale of LLaMA2 is more significant than
that of LLaMA, indicating that the scale effect of LLMs has to do with the capabilities of the base model.
• Context length. Different LLMs have different maximum input limitations [95, 96, 140], and a longer context input
means LLMs can accommodate more historical items for recommendation. However, it remains to be explored whether
the maximum input length of LLMs will affect the recommendation results when the length limitation is not exceeded.
21
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Lanling Xu et al.
Table 5. Overall performance of LLMs on CTR predictions. There are three settings, i.e., zero-shot setting without fine-tuning,
parameter-efficient fine-tuning (PEFT) setting with a few parameters tuned, and fine-tuning (FT) setting with all parameters tuned.
Therefore, we conduct experiments to investigate the differences in the recommendation performance between the
two length versions of ChatGPT (4K and 16K) and ChatGLM3 (2K and 32K). As shown in Fig. 5, expanding the length
limitation of LLMs does not necessarily mean the better recommendation performance, while there is a slight decrease
in NDCG@10. Furthermore, when the maximum input of LLMs remains unchanged, increasing the historical input of
users results in insignificant gains as shown in Fig. 3. Therefore, the key to the recommendation problem is to enable
LLMs to effectively utilize the information within the limited context [62, 125], and a suitable context length selection
for LLMs as recommender systems is worthy of deep consideration.
• As for the public availability, closed-source models outperform the open-source models in terms of the
recommendation performance, but have poorer flexibility.
• As for the model architecture, different frameworks are adapted to different recommendation tasks and
fine-tuning strategies, while LLMs with the casual decoder architecture are still mainstream.
• As for the parameter scale, the larger the parameter scale, the better the recommendation ability.
• As for the context length, a longer maximum context length leads to worse recommendation results.
4.3.3 Comparisons of Tuning Strategies for LLMs (RQ3). Due to the fact that LLMs are not customized to recommender
systems during the training process, it is insufficient to only consider the zero-shot recommendation performance in cold-
start scenarios [24, 40, 69, 72]. In order to explore the impact of different training strategies of LLMs on recommendations,
we compare the click-through rate prediction performance of four LLaMA-based LLMs on two datasets. Specifically,
we consider three training settings of LLMs, i.e., the zero-shot setting without fine-tuning, Parameter-Efficient Fine-
Tuning (PEFT) setting (we use the LoRA [33] here) with a few parameters tuned, and Fine-Tuning (FT) setting with all
parameters tuned, and summarize empirical conclusions from three aspects: overall performance of different settings, the
impact of instruction tuning, and the impact of training data.
• Overall performance of different settings. As shown in Table 5, we can see that the results of fine-tuning
LLMs (PEFT and FT) on only 256 samples are significantly better than the zero-shot performance in cold-start scenarios,
and empirical findings are consistent across four LLMs (LLaMA, Alpaca-LoRA, LLaMA2, LLaMA2-chat) on both datasets.
Furthermore, considering the two kinds of fine-tuning strategies, the performance of the fine-tuning setting is even
better than that of the PEFT setting since more parameters are tuned [33]. In addition to recommendation effects,
training efficiency is also a performance that deserves attention. Therefore, we compare the training time of the two
fine-tuning strategies on two datasets. As shown in Fig. 6, the time for parameter-efficient fine-tuning is significantly
22
Prompting Large Language Models for Recommender Systems Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
300 600
PEFT FT PEFT FT
250 500
Training Time (s)
150 300
100 200
50 100
0 0
LLaMA Alpaca-LoRA LLaMA2 LLaMA2-chat LLaMA Alpaca-LoRA LLaMA2 LLaMA2-chat
(a) MovieLens-1M (b) Amazon-Books
Fig. 6. The comparison of training time w.r.t. the parameter-efficient fine-tuning (PEFT) and fine-tuning (FT) strategies.
NDCG@10
0.60
0.75
0.55
0.60
0.50
0.45 0.45
0 8 16 32 64 128 256 0 8 16 32 64 128 256
The number of sampled data The number of sampled data
(a) MovieLens-1M (b) Amazon-Books
Fig. 7. The fine-tuned recommendation performance of LLMs w.r.t. the number of sampled data, and LoRA is utilized.
less than that for fine-tuning all parameters, which is in agreement with the existing literature [33, 140]. Therefore, the
balance between efficiency and effectiveness needs to be further considered.
• The impact of instruction tuning. Before fine-tuning LLMs using the recommendation data, we are also concerned
about whether the instruction tuning of LLMs using general data can further enhance the subsequent fine-tuning on
recommendations. Therefore, we select two pairs of LLMs as controls, namely LLaMA and Alpaca-LoRA, as well as
LLaMA2 and LLaMA2-chat. As for the Alpaca-LoRA model, it is the tuned version of LLaMA [95] using the the LoRA
strategy [33] on the Alpaca dataset [92]. As for the LLaMA2-chat model, it is also the updated version of LLaMA2 using
the RLHF strategy [96] on the conversational dataset. As illustrated in Table 5 and Fig. 7, the recommendation results of
Alpaca-LoRA are better than that of LLaMA, and the performance of LLaMA2-chat outforms that of LLaMA2, indicating
the effectiveness of performing fine-tuning on general instructions. That is to say, enhancing the general knowledge of
language models can also improve the domain capabilities of LLMs on specified recommendation tasks [3].
• The impact of training data. For few-shot fine-tuning strategies of LLMs, we conduct experiments to explore
the impact of training samples on the recommendation performance. As shown in Fig. 7, we visualize the fine-tuning
results corresponding to different numbers of training samples with the LoRA strategy, and also consider the traditional
recommender SASRec [39]. Despite a few sampled data, LLMs such as LLaMA-7B can quickly adapt to the task
23
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Lanling Xu et al.
I've watched the following movies {Historical interactions of users} I've watched the following movies {Historical interactions of users}
Note that my most recently watched movie is Batman Forever. Note that my most recently watched movie is Showgirls.
Now there are 20 candidate movies that I can watch next: Now there are 20 candidate movies that I can watch next:
['0. Two Moon Juction', '1. Puppet Master 5: The Final Chapter', '2. ['0. Love Bewitched, A (El Amor Brujo)', '1. Three Colors: White', '2. A
Creature Comforts', '3. You've Got Mail', '4. Anatomy Far Off Place', '3. Fried Green Tomatoes’, ……, '14. Dick', ……, '18.
(Anatomie)', ……,'18. Child's Play', '19. The Mask'] …… Welcome To Sarajevo', '19. Getting Even with Dad'] ……
Please show me your ranking results with order numbers …… Please show me your ranking results with order numbers ……
2023/9/8 19:35 OpenAI_Logo.svg 2023/9/8 19:35 OpenAI_Logo.svg
of CTR predictions, achieving noticeable improvements as the number of training samples increases. With only a
small number of training samples, we can stimulate the appreciable recommendation ability of LLMs, reflecting their
remarkable emergence ability. However, when the training samples increase from 0 to 256, the performance of the
traditional recommendation model SASRec still remains 0.5. Experimental results show that LLMs have advantages over
conventional recommenders on abilities of few-shot learning and adapting [3]. In addition, the comparison without
sampled data in Fig. 7 also highlights the zero-shot cold-start ability of large language models.
• For LLM-based recommendations, few-shot training results are better than the zero-shot performance, and
fine-tuning all parameters is more effective but less efficient than parameter-efficient fine-tuning.
• The instruction tuning using general data can further enhance the fine-tuning results of LLMs.
• In few-shot training scenarios, LLMs are more capable of adapting to recommendation tasks.
4.3.4 Case study of limitations (RQ4). Despite powerful capabilities of LLMs, there are also some limitations using LLMs
as recommender systems. To illustrate the limitations of LLMs for recommendation, we choose two recommendation
cases of the powerful closed-source model GPT-4 in Fig. 8 to analyze the possible failure scenarios of LLMs.
• Position bias. On the one hand, the generation results of LLMs have randomness, leading to instability in rec-
ommendation results, and the position bias is a typical manifestation. As shown in the first case of Fig. 8, the LLM
places the ground-truth movie “The Mask” at the bottom of ranking results, because “The Mask” is at the end of the
candidate item sequence. When the order of candidate items is adjusted, the recommendation results will also change,
indicating the position bias during recommendations. Existing literature has also found the position bias in LLM-based
recommendations [32, 74, 133], and methods such as bootstrapping [32] and Bayesian probability framework [74] have
been proposed to calibrate unstable results. However, the instability of LLMs is still an unresolved issue.
• Lack of domain knowledge. On the other hand, the lack of domain knowledge in recommendations can lead to
misunderstandings of LLMs. From the right example in Fig. 8, it can be seen that the LLM places the second movie
in the candidate list, i.e., “Three Colors: White” at the first position in the re-ranking result, which still indicates the
position bias. Furthermore, the LLM places the ground-truth item “Dick” at the bottom of the ranking list. The possible
24
Prompting Large Language Models for Recommender Systems Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
reason is that relying solely on the movie title “Dick”, LLMs cannot infer that the movie is a comedy film about two
teenage girls, which has a greater similarity to the recently watched movie “Showgirls” by users. This case demonstrates
that LLMs may make inappropriate decisions due to a lack of domain knowledge in the field of recommender systems.
In addition to the two cases in Fig. 8, LLMs are also limited by several factors such as the inference time [2], context
length [62], and memory cost [40]. Therefore, exploring the practical application of LLMs through methods such as
knowledge distillation [91, 94] has both academic and industrial values.
• Despite capabilities of LLMs, there are limitations when employing LLMs as recommender systems.
• The generation results of LLMs have randomness, leading to instability such as the position bias.
• The lack of domain knowledge in recommendations can lead to misunderstandings of LLMs.
• Point-wise recommendation [7] regards the recommendation task as a binary classification problem, such as
the rating scoring task [9, 21] and click-through rate (CTR) predictions [3, 40]. It generates a score or likelihood of
25
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Lanling Xu et al.
preference for a given user-item pair with feature engineering. The recommender system ranks items solely based
on the individual characteristics or historical interactions of users without considering the comparison and mutual
influence between items in the input ranking list. When utilizing LLMs for point-wise recommendation, the description
mainly involves the user-item information with the range of score or answer list to limit output formats of LLMs.
• Pair-wise recommendation [84] involves comparing two items in pairs to determine the relative preference for a
particular user. Instead of focusing on individual items, it evaluates pairs of items and calculates the semantic distance
between them. This method often creates item pairs, calculates relative scores, and then ranks or recommends items
based on pair-wise comparisons. The description of pair-wise recommendation in prompts consists of a positive item
and a negative item, instructing LLMs to give the answer from a binary choice list [12].
• List-wise recommendation [59] involves optimizing an entire list of recommended items for a user. Instead of
considering items individually or in pairs, this method treats the entire recommendation list as a single entity and
captures the interior correlations within the list. It aims to create a ranked list of items that collectively maximizes
user satisfaction. For LLM-based list-wise recommendations, the prompt contains a list of items and corresponding
instructions, eliciting the ability of LLMs to explore the potential relationship within the item sequence [32, 74].
• Matching [2, 54] in recommendation involves obtaining a small subset of candidates from the entire item pool for
personalized users, and candidate items are not provided in prompts at this stage. Due to the lack of domain knowledge
in recommender systems, leveraging LLMs for item retrieval requires additional strategies such as model training, item
indexing and grounding approaches [141]. The prompt for matching tasks includes the description of a user in different
forms, e.g., textual profiles and user-item interaction histories, in order to quickly match suitable items.
• Ranking [72] in recommendation involves sorting or ordering items based on the predicted relevance or probability
of interest to the target user. It aims to present the most relevant items at the top of the recommendation list. By
analyzing historical user-item interactions, user preferences, item features, and the given cadidate item list, it requires
LLMs to estimate the relevance of items and determine corresponding positions in the recommendation list [20, 58, 72].
5.2.1 User Interest Type. In the field of recommender systems, there are multiple ways to classify the user interest [23,
84, 144]. For example, based on the type of feedback between users and the platform, the user interest can be divided
into the explicit interest (e.g., ratings and reviews) and implicit interest (e.g., clicks). Although the explicit feedback can
reflect the true intentions of users, the sparse and expensive data limits its application scenarios. In general, we mainly
26
Prompting Large Language Models for Recommender Systems Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
explore user interest modeling in LLMs under implicit feedback scenarios. Another classification of the user interest is
based on the time duration and stability, i.e., short-term intentions, long-term preferences, and hybrid interest as follows:
• Short-term intentions refer to the sudden and accidental intentions or tendencies of personalized users in recent
interactions, which are prone to change and can be influenced by environmental factors in recommender systems [32, 62].
That is to say, short-term intentions are recent, temporary, and variable [133].
• Long-term preferences refer to the stable preferences of users towards certain content, themes, and elements, which
are not easily changed in the short term and will continue to affect recommendation decisions of users [89, 108]. In
contrast to short-term intentions, long-term preferences are long-term, sustained and stable.
• Hybrid interest refers to the combination of long-term preferences and short-term intentions [18, 65, 142]. On
the one hand, short-term intentions are dominated by long-term preferences [128]. On the other hand, long-term
preferences also consist of the short-term interest across multiple time periods [57, 79].
5.2.2 User Representation Forms. In ProLLM4Rec, user interest modeling should consider both the input form for the
prompt template and the specific storage form for the interest memory [108]. Generally, there are three types of input
contents for interest modeling in LLMs: historical item lists, interest descriptions and user embeddings. Corresponding to
different representation forms of the user interest, there are also various storage forms in the interest memory.
• Historical item lists refer to representing personalized users based on the sequence of historical interacted items,
which is widely used in session-based recommendation and sequential recommendation [31, 39, 144]. For token-
based item indexing, the ID sequence of items is used for user modeling in LLMs, similar to traditional sequential
recommendation models [21]. However, without fine-tuning, LLMs cannot recognize the meaning of corresponding IDs.
For description-based item indexing, existing researches generally concatenate the attributes (e.g., titles) of items in the
temporal order, and then input the item sequence as text descriptions to LLMs as user representations [32, 46, 142]. Only
the item IDs that the user has interacted with need to be stored as the interest memory. The list of items can be stored
in the form of numpy or tensor arrays. Despite the simplicity and effectiveness, text attributes such as titles cannot
fully represent items with ambiguity [46]. At the same time, item sequences inevitably contain noise information [112],
and limited sequence lengths make it difficult to accurately express the user interest based on pure lists.
• Interest descriptions refer to representing users based on textual descriptions, which is more applicable to the text
input of LLMs [125]. That is to say, we can use natural languages to model the user interest in the form of text and
input them into LLMs [119]. To facilitate the storage and retrieval of textual descriptions, vector stores are commonly
used [96]. In LLMs, vector stores in the form of key-value pairs can connect implicit vector embeddings with explicit
textual descriptions, improving the retrieval efficiency and comprehension ability [54, 78]. However, the key lies in how
to mine and describe the user interest, and this is what we should make efforts to research and discuss.
• User embeddings refer to concatenating embeddings of users to the input of LLMs as user representations [136],
which is often used for efficient fine-tuning of language models. In this case, we assign each user a unique ID and
corresponding embedding, and add the vector representations of users to the input of LLMs for fine-tuning. At the
same time, user embeddings can also be trained from small-scale traditional recommendation models [134], thereby
incorporating collaborative filtering features from other users. However, it is worth noting that vector embeddings of
27
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Lanling Xu et al.
user representations are only suitable for scenarios where the domain knowledge can be injected into LLMs, and the
black-box property of embeddings also poses challenges to the explainability and understanding of language models.
5.2.3 Modeling Methods. In line with existing literature [100, 140], we classify the methods for modeling interest in
LLMs into three categories, i.e., memory-based methods, retrieval-based methods and generation-based methods.
• Memory-based methods assign external memory to users for storing interest-related historical information [89, 119].
As for user representations, the encoded interest of users can be obtained from the memory [134], and LLMs are instructed
to utilize the memorized interest with carefully designed prompts. It can be seen that the iterative updating of memories
is the key to memory-based methods. As for personalized memory, there are three important operations [100]: (1)
memory reading is to obtain contents of the interest memory based on the user identifier. (2) Memory writing is to
augment new interest of users based on the latest interactions between users and items. (3) Memory reflection is the
periodic examination and updating of existing issues in the memory. Strategies for memory reflection include but are
not limited to self-summarization, self-correction, and reflection based on user feedback [107]. It is memory reflection
that makes memory not just a stack of historical records but a summary of user interest [125]. Note that memory-based
methods specifically refer to obtaining contents from the memory without further processing procedures.
• Retrieval-based methods add a module for personalized interest retrieval on top of the memory-based methods,
utilizing the personalized query and retrieval strategies to obtain user interest from the personalized memory [35, 62, 108].
As for the personalized query, there are generally three ways to form a query for retrieval: (1) candidate items as search
queries [62] for relevant items, (2) recently interacted items as search queries for similar items [132], and (3) the user
profile as search queries for personalized items [108]. For the criteria of retrieval, there are generally multiple trade-offs,
including dimensions such as relevance, recency, and diversity with respect to the user interest [100].
• Generation-based methods utilize the understanding and generation capabilities of LLMs to summarize, infer,
and derive comprehensive user interest [72, 99, 125]. Generally, a combination of memory-based and retrieval-based
methods is required for generation-based methods. For generative language models, the user interest induced by
generative retrieval can further stimulate the prompting ability of LLMs [71, 130]. The generative approach can also be
used for data augmentation, leveraging the general knowledge of LLMs to augment textual descriptions of the user
interest. In current recommendations, LLMs mainly utilize the retrieved relevant items to generate summarized interest
descriptions, which can also serve as a strategy for the memory reflection [89, 108].
5.2.4 Research Questions and Observations for Modeling Short-term Intentions. As discussed in Section 5.2.1, the user
interest modeling in prompts for LLMs can be divided into: (1) short-term intentions, and (2) long-term preferences. In
this section, we explore methods for modeling short-term intentions w.r.t. the following two research questions.
• RQ1: How to select recent and relevant items for modeling short-term intentions?
• RQ2: Whether different modeling forms of short-term intentions have similar recommendation performance?
To analyze how to select recent and relevant items as the short-term intentions, we adopt the retrieval-based strategy
to retrieve items from the long-term interest. Specifically, following the memory-based and retrieval-based strategy, the
personalized query, e.g., candidate items, recently interacted items or the user profile, is encoded into vectors in a unified
form. The most relevant items are retrieved from memory based on semantic similarity. Meanwhile, we set weights
based on the recency of interactions so that the recent items have a higher probability of being retrieved. Finally, the
28
Prompting Large Language Models for Recommender Systems Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
corresponding retrieved text is obtained based on key-value pairs in the memory. Here, we explore the design of the
personalized query and memory of users for item retrieval, and study the influence of different variants as follows.
• Variants of the personalized query: as for the personalized query, we utilize the interest of users for retrieving
items, and design the following two variants to compare strategies for constructing the query.
– Short-term interest: we utilize the short-term interest for the memory retrieval of the long-term interest. The
summarized text based on 10 recently interacted items is employed as the query for retrieving items.
– Long- and short-term interest: we first summarize the personalized profile of each user based on all items he
or she has interacted, then concatenate the user profile and the summarized text based on 10 recently interacted
items for the memory retrieval of the long-term interest.
• Variants of the memory: as for the personalized memory, we consider two variants of the memory contents.
– Global memory: we utilize descriptions of items in the datasets as the memory. Since all users in a dataset share
the same item space, the memory w.r.t. item descriptions is not the personalized memory but global memory,
indicating that different users store the same content for the same item.
– Personalized memory: for each item that a user interacts with, we instruct LLMs to generate personalized
descriptions based on the rating score or comment text as personalized memories. Therefore, the description of
the same item varies among personalized users, which is different from the global memory.
When the items utilized to recent interest modeling are fixed, different representation forms of the items may also affect
the modeling results. To examine the impact of modeling forms on the short-term interest, we consider the recently
interacted items without retrieval methods, and design the following four variants of short-term intentions:
• Recent items: we use titles of the recent 10 items as short-term intentions. It is the default and baseline strategy.
• Recent items + summarized text on recent items: we utilize the least-to-most prompting strategy as mentioned
in Section 5, which summarizes the interest text based on the recently interacted 10 items, then concatenate the
summarized text and the 10 items for modeling short-term intentions.
• Personalized text of recent items: we construct the personalized memory for each user based on the user-item
interactions, and use the personalized descriptions of recently interacted items as short-term intentions.
• Recent items + personalized text of recent items: we employ not only titles of the recent 10 items, but also the
personalized text of recent items to prompt LLMs as the short-term interest.
As shown in Table 6-7, we summarize the following findings of modeling short-term intentions for prompting LLMs.
(1) The retrieval strategies for recent and relevant items (RQ1). First we conduct an experiment to explore the retrieval
strategies for recent and relevant items, and evaluate the selection of query and memory for item retrieval. As shown in
Table 6, we compare four combinations with queries and memories. In terms of the query, combining the design of long-
term and short-term interest yields better results than only the short-term interest. When selecting items that represent
interest for recommendation, the query needs to consider both recent tendencies and long-term preferences of users. Our
results have shown that the personalized long-term interest of users can have a positive effect on retrieval for interest
29
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Lanling Xu et al.
Table 6. Performance comparison on the design of personalized query and memory for item retrieval.
MovieLens-1M Amazon-Books
query memory
ndcg@1 ndcg@10 ndcg@1 ndcg@10
short-term interest global memory 0.2000 0.4295 0.2283 0.4144
long- and short-term interest global memory 0.2200 0.4473 0.2400 0.3980
short-term interest personalized memory 0.2400 0.4441 0.3250 0.4636
long- and short-term interest personalized memory 0.2400 0.4563 0.3300 0.4712
MovieLens-1M Amazon-Books
Short-term interest
ndcg@1 ndcg@10 ndcg@1 ndcg@10
recent items 0.1817 0.3985 0.2467 0.4276
recent items + summarized text on recent items 0.2783 0.5099 0.2617 0.4729
personalized text of recent items 0.2100 0.4380 0.2500 0.4733
recent items + personalized text of recent items 0.2350 0.4483 0.3000 0.5076
modeling. In terms of the memory, the design of the personalized memory with user interest outperforms the global
memory with general descriptions. This result is in line with features of personalized recommender systems [89, 108],
indicating the importance of personalization when designing the user interest for LLMs [132].
(2) The impact of different short-term intentions modeling forms (RQ2). To examine the effectiveness of various modeling
forms w.r.t. the short-term interest, we design four variants based on the recently interacted 10 items and report their
recommendation performance in Table 7. We can see that personalized memory can provide more information to LLMs
than simple titles (comparison between the first and third lines), demonstrating the role of personalized memories with
the user interest. On the other hand, we cannot completely replace the title with a description, as retaining the title
and adding personalized descriptions perform even better results (comparison between the last two lines). Without
the personalized memory, the least-to-most prompting strategy that allows LLMs to summarize recent items first can
provide personalized descriptions and achieve performance improvement (comparison between the first two lines).
There is also a difference in the impact of the short-term interest modeling forms on recommendation abilities between
the two datasets, i.e., MovieLens and Amazon-Books. For the short-term intentions in the MovieLens dataset, the
summarized text induced by LLMs is better than the personalized memory, while in the Amazon-Books dataset, results
are opposite (comparison between the second and fourth lines). It is because there is no review comments but only the
ratings in the MovieLens dataset, so capabilities of the personalized memory is relatively poor.
• While selecting recent and relevant items, it is preferable to combine long-term and short-term user profiles
with personalized descriptions to construct the personalized query and memory.
• For short-term interest of users, representing items with titles and generated personalized descriptions can
further improve the effectiveness of LLMs for recommendation.
30
Prompting Large Language Models for Recommender Systems Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
5.2.5 Research Questions and Observations for Modeling Long-term Preferences. In this section, we conduct experiments
for modeling the long-term preferences of users in ProLLM4Rec. Specifically, we focus on the two research questions.
• RQ1: Whether different modeling forms of long-term preferences have similar recommendation performance?
• RQ2: How to combine short-term intentions and long-term preferences for the hybrid interest?
To evaluate the recommendation performance of different modeling methods for long-term preferences, we design the
following variants of interest modeling forms to answer the first research question:
• Recent items: it is the basic modeling form of selecting only recent 10 items for interest modeling.
• Summarized user profile + recent items: it concatenates the user profile summarized by LLMs and the recently
interacted 10 items for interest modeling, considering both the long-term and short-term interest.
• Retrieved items from long-term history: it retrieves 10 relevant items from the long-term interest. Since recent
items have a higher probability to be retrieved, this variant considers recency as well.
• Retrieved items from long-term history + recent items: it uses both the retrieved 10 items from long-term
history and the recent 10 items as the representation of user interest, considering both the long-term and short-term
interest. It is worth noting that the retrieved items in this variant do not include the latest 10 items.
• Retrieved personalized text from long-term history: it also retrieves 10 relevant items from the long-term
history like “retrieved items from long-term history”, but this variant uses the personalized descriptions of items
from the personalized memory instead of pure titles to represent the user interest.
• Summarized text based on retrieved personalized text: it employs the retrieved personalized text from long-
term history in the last variant by summarizing the retrieved 10 texts into one description. Not providing 10 specific
items for LLMs, this variant only summarizes the user interest through personalized text.
Due to the positive impact of both long-term and short-term interest on the recommendation results of LLMs, we
further explore different combination approaches of long-term and short-term interest to answer the second research
question. Specifically, we consider the following four variants of the combination strategies.
• Short-term interest only: it only utilizes the recently interacted 10 items as the baseline for comparison.
• Concatenate long- and short-term interest: it combines the long-term and short-term interest by concatenating
them together for user interest modeling.
• Short-term as query for long-term interest: it combines the long-term and short-term interest by utilizing the
recently interacted 10 items as the query for retrieving relevant items from the long-term history.
• User profile and short-term as query for long-term interest: it retrieves recent and relevant items from the
long-term interest, and for the query, both the personalized user profile and short-term interest are utilized.
As shown in Table 8-9, key findings of the design for long-term preferences in ProLLM4Rec are summarized as follows.
(1) The impact of different long-term preferences modeling forms (RQ1). As for user interest modeling forms in the first
research question, Table 8 illustrates the performance comparison on six modeling forms of the long-term interest.
31
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Lanling Xu et al.
MovieLens-1M Amazon-Books
User interest modeling
ndcg@1 ndcg@10 ndcg@1 ndcg@10
recent items 0.1817 0.3985 0.2467 0.4276
summarized user profile + recent items 0.2000 0.4466 0.2500 0.4925
retrieved items from long-term history 0.1950 0.4343 0.2850 0.4650
retrieved items from long-term history + recent items 0.2400 0.4441 0.3250 0.4636
retrieved personalized text from long-term history 0.2550 0.4714 0.3150 0.5241
summarized text based on retrieved personalized text 0.2000 0.4492 0.2500 0.4872
Table 9. Performance comparison on the combination strategies of long- and short-term interest.
MovieLens-1M Amazon-Books
Combination strategies
ndcg@1 ndcg@10 ndcg@1 ndcg@10
short-term interest only 0.1817 0.3985 0.2467 0.4276
concatenate long- and short-term interest 0.2000 0.4466 0.2500 0.4925
short-term as query for long-term interest 0.2400 0.4441 0.3250 0.4636
user profile and short-term as query for long-term interest 0.2400 0.4563 0.3300 0.4712
Through the comparison between the first two lines, we can see that long-term preferences of users can improve
modeling effects of the short-term interest. Besides, modeling items retrieved from the long-term interest also improves
the recommendation performance compared to only recent items (comparison between the first and third lines), and
combining the retrieved items and recent item yields even better results in line four. On the whole, the first four lines in
Table 8 indicate the key role of combining the long-term and short-term interest. Furthermore, retrieved personalized
texts from the long-term memory greatly improve the recommendation performance compared to only using recent
items. However, we further summarize the retrieved personalized text and results deteriorate, indicating that the rich
information contained in the original sequence cannot be simply included as general descriptions. The last two lines in
Table 8 highlight the important role of personalized memories w.r.t. personalized users.
(2) The combination strategies for the hybrid interest (RQ2). As for combination strategies of the long-term and short-term
interest, we can see from Table 9 that simply concatenating the long-term and short-term sequences is the most
commonly used and simple way, but not the optimal solution. One possible combination approach is to use the short-
term interest as queries to retrieve the long-term interest, and empirical analysis shows better results. Performance
comparison on the combination strategies validates the importance of combining the long-term and short-term interest.
• For long-term interest, it is recommended to provide a personalized memory to store user preferences.
• Compared to the simple concatenation, retrieving long-term preferences based on short-term intentions is
a better way for forming hybrid interest descriptions, and more approaches can be further explored.
32
Prompting Large Language Models for Recommender Systems Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
Item Pooll
Prompts
• Source of Candidate Items. In practical industrial applications, recommendation is a multi-stage process, including
two kinds of typical stages, i.e., the recall stage and the re-ranking stage [2, 23, 84]. The recall stage in recommender
systems is used for preliminary filtering and selection from the entire set of item candidates, while the re-ranking stage
is used for readjusting the selected candidates, which is more suitable for LLMs as the re-ranker [32, 74]. Therefore, the
source of candidate items to be ranked is crucial for the results of LLMs for recommendation. Popular approaches for
selection of candidate items include traditional recommendation models [128] and retrieval algorithms [62], both of
which provide a list of potentially recommended items in a comparably coarse granularity.
• Representation of Candidate Items. In the construction of prompts, we need to use historical items to represent
users and provide candidate items for LLMs to rank, both of which involve item indexing [21, 63]. However, there is
a gap between the item representation in LLMs and recommendation, so the indexing of items between LLMs and
recommender systems becomes the key and difficult point of ProLLM4Rec. There are three typical ways to index
items: 1) token-based identifiers, 2) description-based identifiers and 3) hybrid identifiers [141]. For token-based
identifiers, researchers often use numerical IDs to identify items [21]. Since token-based identifiers are widely utilized
in traditional recommendation model, they are naturally treated as an item-indexing way for aligning LLMs with
recommendation tasks. For description-based identifiers, researchers generally use the title or other textual attributes of
the item as an identifier, and formalize it into natural language as input [12, 68]. Due to the richness of natural language,
description-based identifiers have high readability and flexibility. For hybrid identifiers, considering collaborative and
textual information of users or items, LLMs can perceive both token-based and description-based identifiers [54, 60].
Current work using hybrid identifiers can be divided into different types. Some of them add item sequence with textual
description and ID information respectively into different positions of prompts [136], while others combine token-based
and description-based embeddings together for each item in sequences through concatenation [54].
33
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Lanling Xu et al.
• Grounding to Candidate Items. After constructing candidate items for recommendation, how to ground the output
of LLMs into recommendation item lists is a crucial problem. Methods for grounding to candidate items are associated
with different types of recommendation tasks. For discriminative recommendation, which provide candidate item lists
in the prompts, it is easy for LLMs to follow instructions and generate the output right within the candidate item lists
through elaborately designed prompts [60, 142]. For generative recommendation tasks, which do not explicitly provide
candidate item list in prompts but require further processing for the output, recent work indicates effective approaches
for mapping the output of LLMs into candidate item lists [141]. For example, GPT4Rec [48] first generates hypothetical
“search queries” using titles of recently interacted items, and then retrieves similar items from the item pool by BM25
algorithm. E4SRec [54] directly utilizes the nearest neighbor search between the output of LLMs and the vectors in the
item linear projection component for grounding to candidate items. In this case, the recommendation process will not
suffer from length limitation problems since they choose not to include candidate item lists into the input of LLMs.
As shown in Figure 1, our framework ProLLM4Rec involves two mapping processes between the recommendation
space and language space: (1) when using LLMs to represent users, items in the recommendation space need to be
used as historical sequences. (2) When re-ranking candidate items, it is required to provide candidate items in the
recommendation space for LLMs. The key issue of semantic space alignment is the indexing and grounding of items.
As for the item indexing, pseudo ID-based sequences and description-based natural languages are two typical forms.
As for the grounding methods, the generative method renders LLMs to output labels of all candidate items, and other
mapping strategies such as logits distribution and similarity calculation can also be considered [54, 128].
5.3.2 Research Questions and Experimental Setup. In this section, we explore the effect of candidate items on Pro-
LLM4Rec, and provide empirical findings on the following research questions:
• RQ1: Whether different representations of candidate items lead to similar recommendation performance?
• RQ2: Given a list of candidates recalled by traditional recommenders, can LLMs improve the recommendation
results by further re-ranking the given candidates?
In this paper, we explore the performance differences between the following two item representation methods:
• ID-based identifiers: we can assign serial numbers or indicators (e.g., ABCD) to the candidate items, and instruct
LLMs to only output the re-ranked indicators instead of the complete text.
• Description-based identifiers: we instruct LLMs to directly output the complete text e.g., titles of candidate items,
and the re-ranking results are parsed from the textual output from LLMs.
Moreover, researchers tend to adopt a two-stage recommendation paradigm, in which conventional recommenders
initially recall multiple candidates, and subsequently, LLMs evaluate and rank these candidates to provide enhanced
recommendations. This approach effectively reduces the computational load of LLMs, enabling more efficient deployment.
However, there is still no consensus regarding the effectiveness of LLMs in reranking the candidates retrieved by
traditional recommendation models [32, 82]. In order to address this, we conduct experiments to compare the impact of
LLMs in re-ranking the original recommendation results generated by traditional recommenders. We evaluate three
classical recommenders, including Pop, BPR [84], and SASRec [39], and report the result in Table 11.
5.3.3 Observations and Discussion. As shown in Table 10-11, findings of candidate items construction are as follows.
34
Prompting Large Language Models for Recommender Systems Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
Table 10. Performance comparison on the grounding forms of LLMs w.r.t. candidate items.
MovieLens-1M Amazon-Books
Method
ndcg@1 ndcg@10 ndcg@20 ndcg@1 ndcg@10 ndcg@20
ID-based identifiers 0.2750 0.5423 0.5725 0.0600 0.1473 0.1685
Description-based identifiers 0.3100 0.5828 0.6043 0.1950 0.4546 0.5079
Table 11. Performance comparison on the effects of LLMs w.r.t. the source of candidate items.
MovieLens-1M Amazon-Books
Method
ndcg@1 ndcg@10 ndcg@20 ndcg@1 ndcg@10 ndcg@20
Pop 0.0000 0.0071 0.0110 0.0000 0.0000 0.0012
Pop (+ LLM) 0.0000 0.0033 0.0086 0.0000 0.0014 0.0014
Impr. 0.00% -53.52% -21.82% 0.00% 1400.00% 16.67%
BPR 0.0000 0.0084 0.0135 0.0000 0.0064 0.0076
BPR (+ LLM) 0.0050 0.0146 0.0170 0.0050 0.0082 0.0107
Impr. 5000.00% 73.81% 25.93% 5000.00% 28.13% 40.79%
SASRec 0.0750 0.1600 0.1836 0.0450 0.1129 0.1318
SASRec (+ LLM) 0.0700 0.1475 0.1736 0.0750 0.1242 0.1454
Impr. -6.67% -7.81% -5.45% 66.67% 10.01% 10.32%
(1) Grounding forms of candidate items (RQ1). Firstly, we focus on the grounding forms of candidate items. As shown in
Table 10, we can see that whether in the movie dataset or the book dataset, the complete name of the output item is
more convenient for extracting and grounding recommendation results than the specified identifier. In other words,
description-based identifiers outperform ID-based identifiers when grounding candidate items for LLMs [32, 140].
However, whether it is ID-based or description-based, the inference time of the grounding strategy based on generative
output will increase with the increasing number of candidate items. A feasible solution is to use one prediction to obtain
the probability score on all candidate items based on the output logits [128]. Language models are more sensitive to the
textual output rather than pure identifiers, but we can also use more advanced index strategies and grounding methods
to improve the accuracy of text and item mapping [34], which requires continuous exploration by future research [82].
(2) The evaluation for re-ranking abilities of LLMs (RQ2). Secondly, we discuss the performance difference between
traditional algorithms and re-ranking results after LLMs. As shown in Table 11, LLMs improve results of the traditional
recommendation method BPR on two datasets by a large margin. However, BPR is not specifically designed for sequential
recommendation and its original performance is not good. As for the basic method Pop and the typical sequential
recommendation model SASRec, the re-ranking effect of LLMs varies between the two datasets. For the MovieLens-1M
dataset in the movie domain, our initial prompts cannot instruct LLMs to achieve better results than traditional methods
such as Pop and SASRec. The possible reason is that this movie dataset is very dense, and fully-trained collaborative
filtering signals are more important than general knowledge of movies. While in the Amazon-Books dataset, the
re-ranking of LLMs can further improve results of Pop and SASRec, possibly because the general knowledge of LLMs
for books can effectively adapt to the sparse recommendation data. The overall experimental results indicate that the
re-ranking performance of LLMs for recommendation results varies depending on specific recommendation methods
35
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Lanling Xu et al.
and datasets. However, it is acknowledged that the general knowledge of LLMs can complement the collaborative
signals of traditional models, tapping the potential to improve traditional recommendation algorithms [53, 136].
• Candidate items construction for LLMs is crucial to the final recommendation results.
• Retrieving candidate items by traditional recommendation models first, and then re-ranking items by LLMs
can further improve the results, but the performance varies on specific methods and datasets.
5.4.1 Prompting Strategies Specialized for Recommender Systems. In general tasks with LLMs, researchers use methods
such as Chain-of-Thought (CoT) prompting to stimulate the logical reasoning ability of LLMs for solving complex
problems [15, 67]. As for ProLLM4Rec, leveraging LLMs for recommender systems can also employ prompting strategies
to further improve the performance of recommendations [65, 125]. In addition, due to the user and item settings of the
recommendation task, the prompting strategy needs to be customized based on the specific user needs in recommender
systems. In this paper, we concentrate on typical prompting strategies for recommendation tasks, i.e., zero-shot prompting,
few-shot prompting, recency-focused prompting, role prompting, chain-of-thought prompting and self-prompting strategy,
and more advanced planning strategies in ProLLM4Rec are left for future exploration [70].
• Zero-shot prompting is the basic scenario and common form among various prompting strategies. Zero-shot
prompting directly provides the context information and task description for LLMs without reference examples. The
task description and prompt formats vary in line with different recommendation tasks [12, 68, 108].
• Few-shot prompting is opposite to zero-shot prompting, and it provides a few demonstrations in prompts to help
LLMs better understand the user intention [4]. While for recommendation tasks, the matching of provided examples
with the current interest of personalized users determines the effectiveness of few-shot prompting. Therefore, it is
worth discussing how to provide reliable context for subsequent recommendations [32].
• Recency-focused prompting is first proposed by [32] based on the fact that the next predicted item in recommen-
dations has a greater correlation with the recently interacted item than other historical items. Therefore, explicitly
emphasizing the recently interacted items in prompts is also a practical prompting strategy. In our initial prompt,
recency-focused prompting is used by default.
• Role prompting refers to playing a “role-playing” game with the language model. In the recommendation task, we can
specify the role of LLMs as the recommender system to serve users in a targeted manner, adding auxiliary information
to describe the role of recommenders in detailed prompts such as conversational recommender systems [30, 38, 105].
• Chain-of-thought prompting is also known as the CoT prompting, which is widely applied in reasoning tasks
such as question answering and mathematical inference [111]. CoT prompting can elicit the ability of LLMs to solve
36
Prompting Large Language Models for Recommender Systems Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
problems step by step [41], or further explicitly decompose the reasoning and analysis process of the task using least-
to-most prompting strategy [143]. In the field of recommender systems, explicit steps can also be provided manually
by researchers to assist in solving recommendation tasks [99]. For ProLLM4Rec, we not only focus on the basic CoT
prompts, but also explore the role of custom-designed step-by-step prompts.
• Self-prompting strategy means that the knowledge for answering questions can be obtained by prompting LLMs
multiple times. LLMs can be required to generate relevant knowledge, providing necessary information for concepts in
the original problem [67]. Meanwhile, the randomness and self consistency [106] generated by LLMs allow them to
generate multiple inference chains, and then use the majority voting method on the results obtained from all chains
as final predictions. When using LLMs to the re-ranking stage of a recommender system, there are many possible
combinations of candidate items, and the recommendation results from the same language model with the same prompt
can be totally different. Therefore, bootstrapping with multiple trials is a scientific guarantee to reduce bias [32, 56, 74].
5.4.2 Research Questions and Experimental Setup. In this section, we explore the construction and design of prompting
strategies for ProLLM4Rec, and provide empirical findings on the following two research questions:
• RQ2: How to choose the planning module for stimulating the recommendation ability of LLMs?
As for the strategy of prompt engineering in ProLLM4Rec, we conduct experiments with different variants of prompts
compared with the basic prompt as follows:
• Original: it is the basic prompt used in our experiments as illustrated in Figure 2(a).
• w/o CoT (step-by-step): we remove the magical spell “let’s think step by step” in the original prompt.
• w/ CoT (least-to-most): we obtain recommendation results by prompting LLMs twice. First, LLMs are prompted
to summarize the personal preference of users based on the recently interacted item sequence. Then in the second
stage, we concatenate the summarized profile to the original prompt for recommendation.
• w/ ICL (self): we use the last interaction of the same user as the example for In-Context Learning (ICL) [32].
• w/ ICL (others): we retrieve an example from historical sequences of other users for the current user, and add the
demonstration to the original prompt. For the selection of examples, we encode recent items of users by textual
representations, and search for similar user sequences based on the inner product between vectors.
5.4.3 Observations and Discussion. As shown in Table 12, findings on prompting strategies of LLMs are listed as follows:
(1) The impacts on different expressions of prompts (RQ1). We compare different prompting strategies for LLMs, and report
results based on the classification of zero-shot prompting and few-shot prompting. As shown in Table 12, “w/o” denotes
that we remove related descriptions compared to the original prompt, and “w/ ” denotes that we add corresponding
instructions to the prompt. As for the zero-shot prompting strategies, we can see that removing the recency-focused
prompting sentence (w/o recency-focused) largely decreases the recommendation performance, demonstrating the
37
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Lanling Xu et al.
MovieLens-1M Amazon-Books
Method
ndcg@1 ndcg@10 ndcg@20 ndcg@1 ndcg@10 ndcg@20
Original 0.1817 0.3985 0.4629 0.2467 0.4276 0.5054
Zero-shot prompting
w/o Recency-focused prompting 0.1550 0.3781 0.4546 0.2300 0.4256 0.4978
w/ Role prompting 0.2750 0.4821 0.5379 0.2800 0.4814 0.5489
w/o CoT (step-by-step prompting) 0.2400 0.4543 0.5193 0.2650 0.4439 0.5163
w/ CoT (least-to-most prompting) 0.2783 0.5099 0.5603 0.2617 0.4729 0.5403
Few-shot prompting
w/ ICL (self) 0.2000 0.4558 0.5171 0.2500 0.4265 0.5055
w/ ICL (others) 0.1900 0.4015 0.4684 0.2500 0.4039 0.4950
key role of recently interacted items in recommendation. Although we provide historical items in chronological order
based on timestamps, LLMs still needs explicit guidance to understand the importance of recent items, indicating that
heuristic knowledge in the recommendation field needs to be supplemented for LLMs. Meanwhile, adding the role
prompting descriptions to the original prompt (w/ role prompting) significantly improves the performance of zero-shot
prompting, which shows that role-playing and expert-like prompts can better leverage the capabilities of LLMs in
specific fields or tasks [140]. In addition, an important strategy for zero-shot prompting is the chain of thoughts, and
we compare effects of two kinds of CoT prompting. When we remove the basic prompting sentence “let’s think step
by step” in the original prompt, the recommendation effect is improved, possibly due to the fact that following the
step-by-step prompts is not conducive to the extraction of results from textual outputs. It also indicates that specific
problem decomposition may be required for recommendation tasks rather than general prompts, and the superior
results of summarizing recent interest before recommendation (least-to-most prompting strategy) confirm this finding.
(2) The effects on planning modules of prompts (RQ2). In contrast to zero-shot prompting, the typical representative of
few-shot prompting in ProLLM4Rec is in-context learning (ICL) based on contextual examples, and we study the ICL
results considering demonstrations from the current user (self) and other users (others), respectively. From the results in
Table 12, it can be seen that using examples of the target user is better than demonstrations from other users, indicating
the personalized needs of each user. However, the strategy of few-shot prompting has insignificant advantage compared
to the zero-shot prompting, and ICL is not fully applicable to recommendation scenarios.
• In chronological order, LLMs still needs explicit guidance to understand the importance of recent items.
• Role-playing and expert-like prompts can better leverage the capabilities of LLMs in specific fields.
• In chain-of-thought prompting, specific problem decomposition is required for recommendation tasks.
• The few-shot prompting strategy has insignificant advantages in recommendation scenarios.
38
Prompting Large Language Models for Recommender Systems Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
6 CONCLUSION
This paper aims to provide a comprehensive exploration of Large Language Models (LLMs) to serve as recommender
systems. It presents a systematic review of the advancements made in LLM-based recommendations, generalizing
related work into multiple scenarios and tasks in terms of LLMs and prompts. We also conduct extensive experiments
on two public datasets to investigate empirical findings for recommendation with LLMs. Our objective is to assist
researchers in gaining a deeper understanding of the characteristics, strengths, and limitations of LLMs utilized as
recommender systems. Considering the significant progress in LLMs, the development of LLM-based recommendations
holds the potential to better align the powerful capabilities of LLMs with the evolving needs of intended users in
the field of recommender systems. By addressing current challenges, we hope that our work will contribute to the
advancement of LLM-based recommendations and serve as an inspiration for future research efforts. Last but not least,
we outline promising directions for future research in utilizing LLMs for recommendation as follows.
• Efficiency optimization of LLMs for recommendation. The key limitation of leveraging LLMs in industrial
recommender systems is efficiency[17, 51, 116], including considerations of both time and space. On the one hand, the
fine-tuning and inference efficiency of LLMs cannot compare to traditional recommendation models [32, 141]. While
techniques such as parameter-efficient fine-tuning can aid in keeping LLMs updated in a computationally efficient
manner, recommender systems need to iterate continuously over time, i.e., incremental learning. Frequent updates
of LLMs inevitably bring spatial and temporal burdens to recommender systems [88]. On the other hand, billions
of parameters in LLMs also pose challenges for the lightweight deployment of recommendation algorithms [95, 96].
Therefore, efficiency optimization of LLMs utilized as recommender systems is one of the prerequisites for large-scale
applications, which has widespread application prospects and scientific research values [35, 88, 98].
• Knowledge distillation of LLMs for recommendation. Since LLMs as recommenders are limited by efficiency,
another feasible approach is to distill [53, 91] the recommendation capabilities of LLMs into lightweight models, striking
a balance between efficiency and effectiveness. Specifically, knowledge distillation is a classic model compression
method adopted in recommender systems [91, 94], with the core idea of guiding lightweight student models to “imitate”
teacher models with better performance and more complex structures such as LLMs. In the field of recommender
systems, the collaborative optimization of LLMs and recommendation models can also be seen as the distillation process
to inject knowledge from LLMs to traditional recommenders, enhancing representations of users and items from
semantic features [53, 79, 119]. Due to the fact that knowledge distillation can improve efficiency while retaining the
capabilities of LLMs for recommendation, more applications need to be fully explored.
• Multimodal recommendations with LLMs. In addition to IDs and text, multimodal recommendations with
LLMs hold considerable promise and warrant comprehensive exploration with the evolving landscape of media
consumption [123, 137]. The essence of multimodal recommendations resides in the fusion of textual and visual
information for enhanced user engagement [26, 77], and the dual functionality of LLMs is pivotal in this context. LLMs
possess the capability to function as multimodal LLMs, enabling the incorporation and encoding of visual information
extracted from images. Furthermore, images can be transformed into textual representations by multimodal encoders
first, and LLMs are mainly used for the subsequent integration of diverse modalities [26]. In addition, the rich multimodal
attributes also provide a basis for diversified recommendation results [63]. As the field progresses, an emphasis on
the reproducibility, benchmarking, and standardization of evaluation datasets and metrics will be essential to foster a
cohesive and informed advancement in multimodal recommender systems leveraging LLMs.
39
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Lanling Xu et al.
• General-purpose LLMs in the vertical field of recommender systems. Developing a comprehensive framework
for LLMs to address multiple recommendation tasks represents a significant avenue for scholarly exploration [9, 21, 108,
133]. The endeavor involves formulating a unified and versatile structure that accommodates the intricacies of various
recommendation tasks [12], encompassing diverse modalities and user preferences [26, 63]. In addition, using agents of
LLMs to simulate recommendation scenarios and make dynamic decisions themselves is also a feasible application of
general-purpose LLMs [100, 101, 132]. Research efforts should focus on refining the architecture, training methodolo-
gies, and adaptability of such a general framework to ensure optimal performance across different recommendation
domains. Investigating transfer learning techniques within this framework, enabling the transfer of knowledge between
recommendation tasks, is crucial for enhancing efficiency and leveraging shared information [3, 133]. In general,
general-purpose LLMs for recommendation have the potential to revolutionize recommender systems by providing a
unified and scalable solution capable of addressing the multifaceted challenges posed by diverse recommendation tasks.
• Privacy and ethical concerns. Prior studies [87, 113] have highlighted the potential issue of language models
generating unreliable or personal contents based on certain prompts and insecure instructions. Recommendation
systems involve massive amounts of the user data [23, 76], and it is crucial to remove private and potentially harmful
information stored in LLMs to enhance the privacy and security of LLM-based applications [6, 44]. Notably, researchers
have observed that Reinforcement Learning from Human Feedback (RLHF) and model editing techniques [22] have the
potential to restrain the generation of poisonous or harmful contents from LLMs, thereby mitigating privacy and ethical
concerns associated with privacy-preserving recommender systems [6]. Nevertheless, it is imperative to acknowledge
that the alignment technology of LLMs may be vulnerable to misuse. The apprehension exists that LLMs could be
manipulated by malicious users to selectively influence agents and amplify specific viewpoints within recommender
systems. Consequently, addressing the security and privacy implications of LLM-based recommendations is crucial, and
there is a need to formulate public regulations to ensure responsible usage and mitigate potential risks [113].
ACKNOWLEDGEMENTS
The authors would like to thank Chuyuan Wang and Chenrui Zhang for participating in discussions of this paper.
Lanling Xu is supported by Meituan Group during her research internship. Xin Zhao is the corresponding author.
40
Prompting Large Language Models for Recommender Systems Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
REFERENCES
[1] Saurabh Agrawal, John Trenkle, and Jaya Kawale. 2023. Beyond Labels: Leveraging Deep Learning and LLMs for Content Metadata. In RecSys.
ACM, 1.
[2] Keqin Bao, Jizhi Zhang, Wenjie Wang, Yang Zhang, Zhengyi Yang, Yancheng Luo, Fuli Feng, Xiangnaan He, and Qi Tian. 2023. A bi-step grounding
paradigm for large language models in recommendation systems. arXiv preprint arXiv:2308.08434 (2023).
[3] Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. 2023. TALLRec: An Effective and Efficient Tuning Framework to
Align Large Language Model with Recommendation. In RecSys. ACM, 1007–1014.
[4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry,
Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
[5] Michael Buckland and Fredric Gey. 1994. The relationship between recall and precision. Journal of the American society for information science 45, 1
(1994), 12–19.
[6] Aldo Gael Carranza, Rezsa Farahani, Natalia Ponomareva, Alex Kurakin, Matthew Jagielski, and Milad Nasr. 2023. Privacy-Preserving Recommender
Systems with Synthetic Query Generation using Differentially Private Large Language Models. arXiv preprint arXiv:2305.05973 (2023).
[7] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa
Ispir, et al. 2016. Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems.
7–10.
[8] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez,
Ion Stoica, and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-
03-30-vicuna/
[9] Zhixuan Chu, Hongyan Hao, Xin Ouyang, Simeng Wang, Yan Wang, Yue Shen, Jinjie Gu, Qing Cui, Longfei Li, Siqiao Xue, et al. 2023. Leveraging
large language models for pre-trained recommender systems. arXiv preprint arXiv:2308.10837 (2023).
[10] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma,
et al. 2022. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022).
[11] Anthony Colas, Jun Araki, Zhengyu Zhou, Bingqing Wang, and Zhe Feng. 2023. Knowledge-grounded Natural Language Recommendation
Explanation. arXiv preprint arXiv:2308.15813 (2023).
[12] Sunhao Dai, Ninglu Shao, Haiyuan Zhao, Weijie Yu, Zihua Si, Chen Xu, Zhongxiang Sun, Xiao Zhang, and Jun Xu. 2023. Uncovering ChatGPT’s
Capabilities in Recommender Systems. In RecSys. ACM, 1126–1132.
[13] Sunhao Dai, Yuqi Zhou, Liang Pang, Weihao Liu, Xiaolin Hu, Yong Liu, Xiao Zhang, and Jun Xu. 2023. Llms may dominate information access:
Neural retrievers are biased towards llm-generated texts. arXiv preprint arXiv:2310.20501 (2023).
[14] Dario Di Palma, Giovanni Maria Biancofiore, Vito Walter Anelli, Fedelucio Narducci, Tommaso Di Noia, and Eugenio Di Sciascio. 2023. Evaluating
ChatGPT as a Recommender System: A Rigorous Approach. arXiv preprint arXiv:2309.03613 (2023).
[15] Shizhe Diao, Pengcheng Wang, Yong Lin, and Tong Zhang. 2023. Active prompting with chain-of-thought for large language models. arXiv preprint
arXiv:2302.12246 (2023).
[16] Yingpeng Du, Di Luo, Rui Yan, Hongzhi Liu, Yang Song, Hengshu Zhu, and Jie Zhang. 2023. Enhancing Job Recommendation through LLM-based
Generative Adversarial Networks. arXiv preprint arXiv:2307.10747 (2023).
[17] Wenqi Fan, Zihuai Zhao, Jiatong Li, Yunqing Liu, Xiaowei Mei, Yiqi Wang, Jiliang Tang, and Qing Li. 2023. Recommender systems in the era of
large language models (llms). arXiv preprint arXiv:2307.02046 (2023).
[18] Luke Friedman, Sameer Ahuja, David Allen, Terry Tan, Hakim Sidahmed, Changbo Long, Jun Xie, Gabriel Schubiner, Ajay Patel, Harsh Lara, et al.
2023. Leveraging Large Language Models in Conversational Recommender Systems. arXiv preprint arXiv:2305.07961 (2023).
[19] Zichuan Fu, Xiangyang Li, Chuhan Wu, Yichao Wang, Kuicai Dong, Xiangyu Zhao, Mengchen Zhao, Huifeng Guo, and Ruiming Tang. 2023. A
Unified Framework for Multi-Domain CTR Prediction via Large Language Models. arXiv:2312.10743 [cs.IR]
[20] Yunfan Gao, Tao Sheng, Youlin Xiang, Yun Xiong, Haofen Wang, and Jiawei Zhang. 2023. Chat-rec: Towards interactive and explainable
llms-augmented recommender system. arXiv preprint arXiv:2303.14524 (2023).
[21] Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. 2022. Recommendation as language processing (rlp): A unified pretrain,
personalized prompt & predict paradigm (p5). In Proceedings of the 16th ACM Conference on Recommender Systems. 299–315.
[22] Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Goldberg. 2022. Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in
the Vocabulary Space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 30–45.
[23] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorization-machine based neural network for CTR
prediction. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. 1725–1731.
[24] Abid Haleem, Mohd Javaid, and Ravi Pratap Singh. 2022. An era of ChatGPT as a significant futuristic support tool: A study on features, abilities,
and challenges. BenchCouncil transactions on benchmarks, standards and evaluations 2, 4 (2022), 100089.
[25] F Maxwell Harper. 2015. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis) 5 4 (2015) 1–19. F
Maxwell Harper and Joseph A Konstan. 2015. The movielens datasets: History and context. Acm transactions on interactive intelligent systems
(tiis) 5 4 (2015) 1–19.
41
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Lanling Xu et al.
[26] Rachel M Harrison, Anton Dereventsov, and Anton Bibin. 2023. Zero-Shot Recommendations with Pre-Trained Large Language Models for
Multimodal Nudging. arXiv preprint arXiv:2309.01026 (2023).
[27] Jesse Harte, Wouter Zorgdrager, Panos Louridas, Asterios Katsifodimos, Dietmar Jannach, and Marios Fragkoulis. 2023. Leveraging Large Language
Models for Sequential Recommendation. In RecSys. ACM, 1096–1102.
[28] Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. Lightgcn: Simplifying and powering graph convolution
network for recommendation. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval.
639–648.
[29] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th
international conference on world wide web. 173–182.
[30] Zhankui He, Zhouhang Xie, Rahul Jha, Harald Steck, Dawen Liang, Yesu Feng, Bodhisattwa Prasad Majumder, Nathan Kallus, and Julian J. McAuley.
2023. Large Language Models as Zero-Shot Conversational Recommenders. In CIKM. ACM, 720–730.
[31] Yupeng Hou, Shanlei Mu, Wayne Xin Zhao, Yaliang Li, Bolin Ding, and Ji-Rong Wen. 2022. Towards universal sequence representation learning for
recommender systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 585–593.
[32] Yupeng Hou, Junjie Zhang, Zihan Lin, Hongyu Lu, Ruobing Xie, Julian McAuley, and Wayne Xin Zhao. 2023. Large language models are zero-shot
rankers for recommender systems. arXiv preprint arXiv:2305.08845 (2023).
[33] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation
of large language models. arXiv preprint arXiv:2106.09685 (2021).
[34] Wenyue Hua, Shuyuan Xu, Yingqiang Ge, and Yongfeng Zhang. 2023. How to Index Item IDs for Recommendation Foundation Models. arXiv
preprint arXiv:2305.06569 (2023).
[35] Xu Huang, Jianxun Lian, Yuxuan Lei, Jing Yao, Defu Lian, and Xing Xie. 2023. Recommender AI Agent: Integrating Large Language Models for
Interactive Recommendations. arXiv preprint arXiv:2308.16505 (2023).
[36] Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 4 (2002), 422–446.
[37] Jianchao Ji, Zelong Li, Shuyuan Xu, Wenyue Hua, Yingqiang Ge, Juntao Tan, and Yongfeng Zhang. 2023. Genrec: Large language model for
generative recommendation. arXiv e-prints (2023), arXiv–2307.
[38] Jiarui Jin, Xianyu Chen, Fanghua Ye, Mengyue Yang, Yue Feng, Weinan Zhang, Yong Yu, and Jun Wang. 2023. Lending Interaction Wings to
Recommender Systems with Conversational Agents. arXiv preprint arXiv:2310.04230 (2023).
[39] Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recommendation. In 2018 IEEE international conference on data mining
(ICDM). IEEE, 197–206.
[40] Wang-Cheng Kang, Jianmo Ni, Nikhil Mehta, Maheswaran Sathiamoorthy, Lichan Hong, Ed Chi, and Derek Zhiyuan Cheng. 2023. Do LLMs
Understand User Preferences? Evaluating LLMs On User Rating Prediction. arXiv preprint arXiv:2305.06474 (2023).
[41] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners.
Advances in neural information processing systems 35 (2022), 22199–22213.
[42] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009), 30–37.
[43] Teven Le Scao and Alexander M Rush. 2021. How many data points is a prompt worth?. In Proceedings of the 2021 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies. 2627–2636.
[44] Yuxuan Lei, Jianxun Lian, Jing Yao, Xu Huang, Defu Lian, and Xing Xie. 2023. RecExplainer: Aligning Large Language Models for Recommendation
Model Interpretability. arXiv preprint arXiv:2311.10947 (2023).
[45] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim
Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33
(2020), 9459–9474.
[46] Jiacheng Li, Ming Wang, Jin Li, Jinmiao Fu, Xin Shen, Jingbo Shang, and Julian McAuley. 2023. Text Is All You Need: Learning Language
Representations for Sequential Recommendation. arXiv preprint arXiv:2305.13731 (2023).
[47] Jiacheng Li, Ming Wang, Jin Li, Jinmiao Fu, Xin Shen, Jingbo Shang, and Julian J. McAuley. 2023. Text Is All You Need: Learning Language
Representations for Sequential Recommendation. In KDD. ACM, 1258–1267.
[48] Jinming Li, Wentao Zhang, Tian Wang, Guanglei Xiong, Alan Lu, and Gerard Medioni. 2023. GPT4Rec: A generative framework for personalized
recommendation and user interests interpretation. arXiv preprint arXiv:2304.03879 (2023).
[49] Lei Li, Yongfeng Zhang, and Li Chen. 2023. Personalized prompt learning for explainable recommendation. ACM Transactions on Information
Systems 41, 4 (2023), 1–26.
[50] Lei Li, Yongfeng Zhang, and Li Chen. 2023. Prompt Distillation for Efficient LLM-based Recommendation. In CIKM. ACM, 1348–1357.
[51] Lei Li, Yongfeng Zhang, Dugang Liu, and Li Chen. 2023. Large Language Models for Generative Recommendation: A Survey and Visionary
Discussions. arXiv preprint arXiv:2309.01157 (2023).
[52] Ruyu Li, Wenhao Deng, Yu Cheng, Zheng Yuan, Jiaqi Zhang, and Fajie Yuan. 2023. Exploring the Upper Limits of Text-Based Collaborative Filtering
Using Large Language Models: Discoveries and Insights. arXiv preprint arXiv:2305.11700 (2023).
[53] Xiangyang Li, Bo Chen, Lu Hou, and Ruiming Tang. 2023. CTRL: Connect Tabular and Language Model for CTR Prediction. arXiv preprint
arXiv:2306.02841 (2023).
42
Prompting Large Language Models for Recommender Systems Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
[54] Xinhang Li, Chong Chen, Xiangyu Zhao, Yong Zhang, and Chunxiao Xing. 2023. E4SRec: An Elegant Effective Efficient Extensible Solution of
Large Language Models for Sequential Recommendation. arXiv:2312.02443 [cs.IR]
[55] Xinyi Li, Yongfeng Zhang, and Edward C Malthouse. 2023. Exploring Fine-tuning ChatGPT for News Recommendation. arXiv preprint
arXiv:2311.05850 (2023).
[56] Xinyi Li, Yongfeng Zhang, and Edward C Malthouse. 2023. Exploring Fine-tuning ChatGPT for News Recommendation. arXiv:2311.05850 [cs.IR]
[57] Xinyi Li, Yongfeng Zhang, and Edward C Malthouse. 2023. PBNR: Prompt-based News Recommender System. arXiv preprint arXiv:2304.07862
(2023).
[58] Xinyi Li, Yongfeng Zhang, and Edward C Malthouse. 2023. A Preliminary Study of ChatGPT on News Recommendation: Personalization, Provider
Fairness, Fake News. arXiv preprint arXiv:2306.10702 (2023).
[59] Dawen Liang, Rahul G Krishnan, Matthew D Hoffman, and Tony Jebara. 2018. Variational autoencoders for collaborative filtering. In Proceedings of
the 2018 world wide web conference. 689–698.
[60] Jiayi Liao, Sihang Li, Zhengyi Yang, Jiancan Wu, Yancheng Yuan, Xiang Wang, and Xiangnan He. 2023. LLaRA: Aligning Large Language Models
with Sequential Recommenders. arXiv:2312.02445 [cs.IR]
[61] Jianghao Lin, Xinyi Dai, Yunjia Xi, Weiwen Liu, Bo Chen, Xiangyang Li, Chenxu Zhu, Huifeng Guo, Yong Yu, Ruiming Tang, et al. 2023. How Can
Recommender Systems Benefit from Large Language Models: A Survey. arXiv preprint arXiv:2306.05817 (2023).
[62] Jianghao Lin, Rong Shan, Chenxu Zhu, Kounianhua Du, Bo Chen, Shigang Quan, Ruiming Tang, Yong Yu, and Weinan Zhang. 2023. ReLLa:
Retrieval-enhanced Large Language Models for Lifelong Sequential Behavior Comprehension in Recommendation. arXiv preprint arXiv:2308.11131
(2023).
[63] Xinyu Lin, Wenjie Wang, Yongqi Li, Fuli Feng, See-Kiong Ng, and Tat-Seng Chua. 2023. A Multi-facet Paradigm to Bridge Large Language Model
and Recommendation. arXiv preprint arXiv:2310.06491 (2023).
[64] Zihan Lin, Changxin Tian, Yupeng Hou, and Wayne Xin Zhao. 2022. Improving graph collaborative filtering with neighborhood-enriched
contrastive learning. In Proceedings of the ACM Web Conference 2022. 2320–2329.
[65] Dairui Liu, Boming Yang, Honghui Du, Derek Greene, Aonghus Lawlor, Ruihai Dong, and Irene Li. 2023. RecPrompt: A Prompt Tuning Framework
for News Recommendation Using Large Language Models. arXiv:2312.10463 [cs.IR]
[66] Fan Liu, Yaqi Liu, Zhiyong Cheng, Liqiang Nie, and Mohan Kankanhalli. 2023. Understanding Before Recommendation: Semantic Aspect-Aware
Review Exploitation via Large Language Models. arXiv:2312.16275 [cs.IR]
[67] Jiacheng Liu, Alisa Liu, Ximing Lu, Sean Welleck, Peter West, Ronan Le Bras, Yejin Choi, and Hannaneh Hajishirzi. 2022. Generated Knowledge
Prompting for Commonsense Reasoning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers). 3154–3169.
[68] Junling Liu, Chao Liu, Renjie Lv, Kang Zhou, and Yan Zhang. 2023. Is chatgpt a good recommender? a preliminary study. arXiv preprint
arXiv:2304.10149 (2023).
[69] Junling Liu, Chao Liu, Peilin Zhou, Qichen Ye, Dading Chong, Kang Zhou, Yueqi Xie, Yuwei Cao, Shoujin Wang, Chenyu You, et al. 2023. Llmrec:
Benchmarking large language models on recommendation task. arXiv preprint arXiv:2308.12241 (2023).
[70] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic
survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35.
[71] Qijiong Liu, Nuo Chen, Tetsuya Sakai, and Xiao-Ming Wu. 2023. A First Look at LLM-Powered Generative News Recommendation. arXiv preprint
arXiv:2305.06566 (2023).
[72] Sichun Luo, Bowei He, Haohan Zhao, Yinya Huang, Aojun Zhou, Zongpeng Li, Yuanzhang Xiao, Mingjie Zhan, and Linqi Song. 2023. RecRanker:
Instruction Tuning Large Language Model as Ranker for Top-k Recommendation. arXiv:2312.16018 [cs.IR]
[73] Yucong Luo, Mingyue Cheng, Hao Zhang, Junyu Lu, Qi Liu, and Enhong Chen. 2023. Unlocking the Potential of Large Language Models for
Explainable Recommendations. arXiv:2312.15661 [cs.IR]
[74] Tianhui Ma, Yuan Cheng, Hengshu Zhu, and Hui Xiong. 2023. Large Language Models are Not Stable Recommender Systems. arXiv:2312.15746 [cs.IR]
[75] Sheshera Mysore, Andrew McCallum, and Hamed Zamani. 2023. Large Language Model Augmented Narrative Driven Recommendations. In
RecSys. ACM, 777–783.
[76] Jianmo Ni, Jiacheng Li, and Julian McAuley. 2019. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In
Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language
processing (EMNLP-IJCNLP). 188–197.
[77] Xingyu Pan, Yushuo Chen, Changxin Tian, Zihan Lin, Jinpeng Wang, He Hu, and Wayne Xin Zhao. 2022. Multimodal meta-learning for cold-start
sequential recommendation. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 3421–3430.
[78] Aleksandr V Petrov and Craig Macdonald. 2023. Generative Sequential Recommendation with GPTRec. arXiv preprint arXiv:2306.11114 (2023).
[79] Junyan Qiu, Haitao Wang, Zhaolin Hong, Yiping Yang, Qiang Liu, and Xingxing Wang. 2023. ControlRec: Bridging the Semantic Gap between
Language Model and Personalized Recommendation. arXiv:2311.16441 [cs.IR]
[80] Zhaopeng Qiu, Xian Wu, Jingyue Gao, and Wei Fan. 2021. U-BERT: Pre-training user representations for improved recommendation. In Proceedings
of the AAAI Conference on Artificial Intelligence, Vol. 35. 4320–4327.
[81] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring
the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
43
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Lanling Xu et al.
[82] Xubin Ren, Wei Wei, Lianghao Xia, Lixin Su, Suqi Cheng, Junfeng Wang, Dawei Yin, and Chao Huang. 2023. Representation Learning with Large
Language Models for Recommendation. arXiv preprint arXiv:2310.15950 (2023).
[83] Steffen Rendle. 2010. Factorization machines. In 2010 IEEE International conference on data mining. IEEE, 995–1000.
[84] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from implicit feedback.
In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence. 452–461.
[85] Scott Sanner, Krisztian Balog, Filip Radlinski, Ben Wedin, and Lucas Dixon. 2023. Large Language Models are Competitive Near Cold-start
Recommenders for Language- and Item-based Preferences. In RecSys. ACM, 890–896.
[86] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based collaborative filtering recommendation algorithms. In Proceedings
of the 10th international conference on World Wide Web. 285–295.
[87] Xinyue Shen, Zeyuan Chen, Michael Backes, and Yang Zhang. 2023. In chatgpt we trust? measuring and characterizing the reliability of chatgpt.
arXiv preprint arXiv:2304.08979 (2023).
[88] Tianhao Shi, Yang Zhang, Zhijian Xu, Chong Chen, Fuli Feng, Xiangnan He, and Qi Tian. 2023. Preliminary Study on Incremental Learning for
Large Language Model-based Recommender Systems. arXiv:2312.15599 [cs.IR]
[89] Yubo Shu, Hansu Gu, Peng Zhang, Haonan Zhang, Tun Lu, Dongsheng Li, and Ning Gu. 2023. RAH! RecSys-Assistant-Human: A Human-Central
Recommendation Framework with Large Language Models. arXiv preprint arXiv:2308.09904 (2023).
[90] Xiaoyuan Su and Taghi M Khoshgoftaar. 2009. A survey of collaborative filtering techniques. Advances in artificial intelligence 2009 (2009).
[91] Wenqi Sun, Ruobing Xie, Junjie Zhang, Wayne Xin Zhao, Leyu Lin, and Ji-Rong Wen. 2024. Distillation is All You Need for Practically Using
Different Pre-trained Recommendation Models. arXiv:2401.00797 [cs.IR]
[92] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford
Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca.
[93] Poonam B Thorat, Rajeshwari M Goudar, and Sunita Barve. 2015. Survey on collaborative filtering, content-based filtering and hybrid recommen-
dation system. International Journal of Computer Applications 110, 4 (2015), 31–36.
[94] Zhen Tian, Ting Bai, Zibin Zhang, Zhiyuan Xu, Kangyi Lin, Ji-Rong Wen, and Wayne Xin Zhao. 2023. Directed Acyclic Graph Factorization
Machines for CTR Prediction via Knowledge Distillation. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data
Mining. 715–723.
[95] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric
Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
[96] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava,
Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
[97] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is
all you need. Advances in neural information processing systems 30 (2017).
[98] Dui Wang, Xiangyu Hou, Xiaohui Yang, Bo Zhang, Renbing Chen, and Daiyue Xue. 2023. Multiple Key-value Strategy in Recommendation Systems
Incorporating Large Language Model. arXiv preprint arXiv:2310.16409 (2023).
[99] Lei Wang and Ee-Peng Lim. 2023. Zero-Shot Next-Item Recommendation using Large Pretrained Language Models. arXiv preprint arXiv:2304.03153
(2023).
[100] Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2023. A survey
on large language model based autonomous agents. arXiv preprint arXiv:2308.11432 (2023).
[101] Lei Wang, Jingsen Zhang, Xu Chen, Yankai Lin, Ruihua Song, Wayne Xin Zhao, and Ji-Rong Wen. 2023. RecAgent: A Novel Simulation Paradigm
for Recommender Systems. arXiv preprint arXiv:2306.02552 (2023).
[102] Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. 2023. Large language models are not
fair evaluators. arXiv preprint arXiv:2305.17926 (2023).
[103] Wenjie Wang, Xinyu Lin, Fuli Feng, Xiangnan He, and Tat-Seng Chua. 2023. Generative recommendation: Towards next-generation recommender
paradigm. arXiv preprint arXiv:2304.03516 (2023).
[104] Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019. Neural graph collaborative filtering. In Proceedings of the 42nd
international ACM SIGIR conference on Research and development in Information Retrieval. 165–174.
[105] Xiaolei Wang, Xinyu Tang, Xin Zhao, Jingyuan Wang, and Ji-Rong Wen. 2023. Rethinking the Evaluation for Conversational Recommendation in
the Era of Large Language Models. In EMNLP. Association for Computational Linguistics, 10052–10065.
[106] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-Consistency
Improves Chain of Thought Reasoning in Language Models. In The Eleventh International Conference on Learning Representations.
[107] Yan Wang, Zhixuan Chu, Xin Ouyang, Simeng Wang, Hongyan Hao, Yue Shen, Jinjie Gu, Siqiao Xue, James Y Zhang, Qing Cui, et al. 2023.
Enhancing recommender systems with large language model reasoning graphs. arXiv preprint arXiv:2308.10835 (2023).
[108] Yancheng Wang, Ziyan Jiang, Zheng Chen, Fan Yang, Yingxue Zhou, Eunah Cho, Xing Fan, Xiaojiang Huang, Yanbin Lu, and Yingzhen Yang. 2023.
Recmind: Large language model powered agent for recommendation. arXiv preprint arXiv:2308.14296 (2023).
[109] Yu Wang, Zhiwei Liu, Jianguo Zhang, Weiran Yao, Shelby Heinecke, and Philip S. Yu. 2023. DRDT: Dynamic Reflection with Divergent Thinking
for LLM-based Sequential Recommendation. arXiv:2312.11336 [cs.IR]
44
Prompting Large Language Models for Recommender Systems Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
[110] Zhoumeng Wang. 2023. Empowering Few-Shot Recommender Systems with Large Language Models – Enhanced Representations.
arXiv:2312.13557 [cs.IR]
[111] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting
elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
[112] Wei Wei, Xubin Ren, Jiabin Tang, Qinyong Wang, Lixin Su, Suqi Cheng, Junfeng Wang, Dawei Yin, and Chao Huang. 2023. LLMRec: Large
Language Models with Graph Augmentation for Recommendation. arXiv preprint arXiv:2311.00423 (2023).
[113] Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa
Kasirzadeh, et al. 2021. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359 (2021).
[114] Chuhan Wu, Fangzhao Wu, Tao Qi, and Yongfeng Huang. 2021. Empowering news recommendation with pre-trained language models. In
Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1652–1656.
[115] Jiancan Wu, Xiang Wang, Fuli Feng, Xiangnan He, Liang Chen, Jianxun Lian, and Xing Xie. 2021. Self-supervised graph learning for recommendation.
In Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval. 726–735.
[116] Likang Wu, Zhi Zheng, Zhaopeng Qiu, Hao Wang, Hongchao Gu, Tingjia Shen, Chuan Qin, Chen Zhu, Hengshu Zhu, Qi Liu, et al. 2023. A Survey
on Large Language Models for Recommendation. arXiv preprint arXiv:2305.19860 (2023).
[117] Shiwen Wu, Fei Sun, Wentao Zhang, Xu Xie, and Bin Cui. 2022. Graph neural networks in recommender systems: a survey. Comput. Surveys 55, 5
(2022), 1–37.
[118] Tianyu Wu, Shizhu He, Jingping Liu, Siqi Sun, Kang Liu, Qing-Long Han, and Yang Tang. 2023. A brief overview of ChatGPT: The history, status
quo and potential future development. IEEE/CAA Journal of Automatica Sinica 10, 5 (2023), 1122–1136.
[119] Yunjia Xi, Weiwen Liu, Jianghao Lin, Jieming Zhu, Bo Chen, Ruiming Tang, Weinan Zhang, Rui Zhang, and Yong Yu. 2023. Towards Open-World
Recommendation with Knowledge Augmentation from Large Language Models. arXiv preprint arXiv:2306.10933 (2023).
[120] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2023. Efficient Streaming Language Models with Attention Sinks. arXiv
preprint arXiv:2309.17453 (2023).
[121] Lanling Xu, Zhen Tian, Gaowei Zhang, Junjie Zhang, Lei Wang, Bowen Zheng, Yifan Li, Jiakai Tang, Zeyu Zhang, Yupeng Hou, et al. 2023. Towards
a More User-Friendly and Easy-to-Use Benchmark Library for Recommender Systems. In Proceedings of the 46th International ACM SIGIR Conference
on Research and Development in Information Retrieval. 2837–2847.
[122] Fan Yang, Zheng Chen, Ziyan Jiang, Eunah Cho, Xiaojiang Huang, and Yanbin Lu. 2023. PALR: Personalization Aware LLMs for Recommendation.
arXiv e-prints (2023), arXiv–2305.
[123] Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. 2023. Set-of-Mark Prompting Unleashes Extraordinary Visual
Grounding in GPT-4V. arXiv preprint arXiv:2310.11441 (2023).
[124] Zhengyi Yang, Jiancan Wu, Yanchen Luo, Jizhi Zhang, Yancheng Yuan, An Zhang, Xiang Wang, and Xiangnan He. 2023. Large Language Model
Can Interpret Latent Space of Sequential Recommender. arXiv preprint arXiv:2310.20487 (2023).
[125] Jing Yao, Wei Xu, Jianxun Lian, Xiting Wang, Xiaoyuan Yi, and Xing Xie. 2023. Knowledge Plugins: Enhancing Large Language Models for
Domain-Specific Recommendations. arXiv preprint arXiv:2311.10779 (2023).
[126] Zheng Yuan, Fajie Yuan, Yu Song, Youhua Li, Junchen Fu, Fei Yang, Yunzhu Pan, and Yongxin Ni. 2023. Where to Go Next for Recommender
Systems? ID- vs. Modality-based Recommender Models Revisited. In SIGIR. ACM, 2639–2649.
[127] Zheng Yuan, Fajie Yuan, Yu Song, Youhua Li, Junchen Fu, Fei Yang, Yunzhu Pan, and Yongxin Ni. 2023. Where to go next for recommender systems?
id-vs. modality-based recommender models revisited. arXiv preprint arXiv:2303.13835 (2023).
[128] Zhenrui Yue, Sara Rabhi, Gabriel de Souza Pereira Moreira, Dong Wang, and Even Oldridge. 2023. LlamaRec: Two-Stage Recommendation using
Large Language Models for Ranking. arXiv preprint arXiv:2311.02089 (2023).
[129] Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. 2022. Glm-130b:
An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414 (2022).
[130] An Zhang, Leheng Sheng, Yuxin Chen, Hao Li, Yang Deng, Xiang Wang, and Tat-Seng Chua. 2023. On Generative Agents in Recommendation.
arXiv preprint arXiv:2310.10108 (2023).
[131] Jizhi Zhang, Keqin Bao, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. 2023. Is ChatGPT Fair for Recommendation? Evaluating Fairness
in Large Language Model Recommendation. In RecSys. ACM, 993–999.
[132] Junjie Zhang, Yupeng Hou, Ruobing Xie, Wenqi Sun, Julian McAuley, Wayne Xin Zhao, Leyu Lin, and Ji-Rong Wen. 2023. AgentCF: Collaborative
Learning with Autonomous Language Agents for Recommender Systems. arXiv preprint arXiv:2310.09233 (2023).
[133] Junjie Zhang, Ruobing Xie, Yupeng Hou, Wayne Xin Zhao, Leyu Lin, and Ji-Rong Wen. 2023. Recommendation as instruction following: A large
language model empowered recommendation approach. arXiv preprint arXiv:2305.07001 (2023).
[134] Wenxuan Zhang, Hongzhi Liu, Yingpeng Du, Chen Zhu, Yang Song, Hengshu Zhu, and Zhonghai Wu. 2023. Bridging the Information Gap Between
Domain-Specific Model and General LLM for Personalized Recommendation. arXiv preprint arXiv:2311.03778 (2023).
[135] Yuhui Zhang, Hao Ding, Zeren Shui, Yifei Ma, James Zou, Anoop Deoras, and Hao Wang. 2021. Language models as recommender systems:
Evaluations and limitations. (2021).
[136] Yang Zhang, Fuli Feng, Jizhi Zhang, Keqin Bao, Qifan Wang, and Xiangnan He. 2023. CoLLM: Integrating Collaborative Embeddings into Large
Language Models for Recommendation. arXiv preprint arXiv:2310.19488 (2023).
45
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Lanling Xu et al.
[137] Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. 2023. Multimodal chain-of-thought reasoning in language
models. arXiv preprint arXiv:2302.00923 (2023).
[138] Wayne Xin Zhao, Yupeng Hou, Xingyu Pan, Chen Yang, Zeyu Zhang, Zihan Lin, Jingsen Zhang, Shuqing Bian, Jiakai Tang, Wenqi Sun, et al.
2022. RecBole 2.0: Towards a More Up-to-Date Recommendation Library. In Proceedings of the 31st ACM International Conference on Information &
Knowledge Management. 4722–4726.
[139] Wayne Xin Zhao, Shanlei Mu, Yupeng Hou, Zihan Lin, Yushuo Chen, Xingyu Pan, Kaiyuan Li, Yujie Lu, Hui Wang, Changxin Tian, Yingqian Min,
Zhichao Feng, Xinyan Fan, Xu Chen, Pengfei Wang, Wendi Ji, Yaliang Li, Xiaoling Wang, and Ji-Rong Wen. 2021. RecBole: Towards a Unified,
Comprehensive and Efficient Framework for Recommendation Algorithms. In CIKM. ACM, 4653–4664.
[140] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al.
2023. A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).
[141] Bowen Zheng, Yupeng Hou, Hongyu Lu, Yu Chen, Wayne Xin Zhao, and Ji-Rong Wen. 2023. Adapting Large Language Models by Integrating
Collaborative Semantics for Recommendation. arXiv preprint arXiv:2311.09049 (2023).
[142] Aakas Zhiyuli, Yanfang Chen, Xuan Zhang, and Xun Liang. 2023. BookGPT: A General Framework for Book Recommendation Empowered by
Large Language Model. arXiv preprint arXiv:2305.15673 (2023).
[143] Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V Le,
et al. 2022. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. In The Eleventh International Conference on Learning
Representations.
[144] Kun Zhou, Hui Wang, Wayne Xin Zhao, Yutao Zhu, Sirui Wang, Fuzheng Zhang, Zhongyuan Wang, and Ji-Rong Wen. 2020. S3-rec: Self-supervised
learning for sequential recommendation with mutual information maximization. In Proceedings of the 29th ACM international conference on
information & knowledge management. 1893–1902.
[145] Yaochen Zhu, Liang Wu, Qi Guo, Liangjie Hong, and Jundong Li. 2023. Collaborative Large Language Model for Recommender Systems. arXiv
preprint arXiv:2311.01343 (2023).
46