0% found this document useful (0 votes)
198 views25 pages

Userbench: An Interactive Gym Environment For User-Centric Agents

UserBench is an interactive benchmark designed to evaluate user-centric agents in multi-turn, preference-driven interactions, addressing the challenges of vague and evolving user goals. It simulates users who gradually reveal their preferences, requiring agents to clarify intent and adapt their decisions using tools. The evaluation reveals that current LLMs struggle to align with user intent, highlighting the need for improved collaborative capabilities in AI agents.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
198 views25 pages

Userbench: An Interactive Gym Environment For User-Centric Agents

UserBench is an interactive benchmark designed to evaluate user-centric agents in multi-turn, preference-driven interactions, addressing the challenges of vague and evolving user goals. It simulates users who gradually reveal their preferences, requiring agents to clarify intent and adapt their decisions using tools. The evaluation reveals that current LLMs struggle to align with user intent, highlighting the need for improved collaborative capabilities in AI agents.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UserBench: An Interactive Gym Environment for User-Centric Agents

Cheng Qian1,2 , Zuxin Liu1 , Akshara Prabhakar1 , Zhiwei Liu1 , Jianguo Zhang1 ,
Haolin Chen1 , Heng Ji2 , Weiran Yao1 , Shelby Heinecke1 , Silvio Savarese1 ,
Caiming Xiong1 , Huan Wang1
1 2
Salesforce AI Research University of Illinois Urbana-Champaign

Abstract et al., 2025) and tool creation (Qian et al., 2023; Cai
et al., 2024; Yuan et al., 2024; Qian et al., 2024d),
Large Language Models (LLMs)-based agents enabling them to retrieve information and inter-
arXiv:2507.22034v1 [[Link]] 29 Jul 2025

have made impressive progress in reasoning


act with external environments including database,
and tool use, enabling them to solve complex
tasks. However, their ability to proactively web and games (Deng et al., 2024; Qian et al.,
collaborate with users, especially when goals 2024a; Zhu et al., 2025). This tool-augmented rea-
are vague, evolving, or indirectly expressed, soning grants LLMs agentic capabilities: they can
remains underexplored. To address this gap, autonomously execute tasks via these tools.
we introduce UserBench, a user-centric bench- However, existing agentic environments often
mark designed to evaluate agents in multi-turn, overlook a critical dimension: the role of the user.
preference-driven interactions. UserBench fea-
Despite achieving strong task performance, agents
tures simulated users who start with underspeci-
fied goals and reveal preferences incrementally, frequently fail to satisfy real user needs due to their
requiring agents to proactively clarify intent inability to understand, adapt to, and collaborate
and make grounded decisions with tools. Our with the task initiator (Qian et al., 2024b; Lu et al.,
evaluation of leading open- and closed-source 2025). Current evaluations primarily assess tool
LLMs reveals a significant disconnect between use and task execution, rarely considering whether
task completion and user alignment. For in- the agent effectively interprets and aligns with the
stance, models provide answers that fully align user’s underlying and evolving intent.
with all user intents only 20% of the time on
This motivates our central research question:
average, and even the most advanced models
uncover fewer than 30% of all user preferences How can we evaluate agents from a user-centric
through active interaction. These results high- perspective? To answer this, we first exam-
light the challenges of building agents that are ine how users typically communicate goals. Hu-
not just capable task executors, but true collabo- man communication is inherently a joint activity,
rative partners. UserBench offers an interactive where meaning is co-constructed through interac-
environment to measure and advance this crit- tion (Clark, 1996). Moreover, language is inher-
ical capability. All code and data are publicly
ently ambiguous, making it difficult for users to
available to support future research.1
fully and clearly convey their intent in a single inter-
1 Introduction action. (Liu et al., 2023). As such, user instructions
tend to share three core traits: (i) Underspecifi-
Large Language Models (LLMs) have demon- cation: users often initiate requests before fully
strated remarkable capabilities in complex reason- formulating their goals; (ii) Incrementality: intent
ing (Jaech et al., 2024; Guo et al., 2025), code gen- emerges and evolves across interaction; and (iii)
eration (Hui et al., 2024; Guo et al., 2024; Zhang Indirectness: users may obscure or soften their
et al., 2024b), and solving advanced mathemati- true intent due to social or strategic reasons.
cal problems (Yang et al., 2024; Shao et al., 2024; Revolving around these traits, we introduce
Qian et al., 2025c). Their limitations in updated UserBench, a user-centric environment designed to
knowledge and precise computation are increas- facilitate an agent’s ability to engage in meaningful,
ingly mitigated through tool use (Liu et al., 2024; multi-turn interactions with users who exhibit these
Qian et al., 2024c; Zhang et al., 2024a; Prabhakar traits. In UserBench, simulated users provide initial
1
UserBench released at [Link] vague task instruction (underspecification), gradu-
SalesforceAIResearch/UserBench ally reveal preferences over time (incrementality),

1
and often do so implicitly (indirectness). Agents dress this, benchmarks like User Reported Scenar-
must proactively clarify goals, interpret subtle cues, ios (Wang et al., 2024a), Intention-in-Interaction
and adaptively reason through tool use to succeed. (Qian et al., 2024b), and WildBench (Lin et al.,
Built on the standard Gymnasium framework, 2024) compile real user queries to evaluate whether
UserBench offers a modular, extensible setup with models align with nuanced human intent. These
a standardized interaction interface and a stable datasets emphasize preference satisfaction, multi-
tool-use backend, enabling rigorous and repro- intent understanding, and performance in “in-the-
ducible evaluation. We benchmark several leading wild” conditions. Complementing this, other
open- and closed-source models and find that cur- benchmarks such as MINT (Wang et al., 2024b),
rent LLMs still struggle to interactively uncover PrefEval (Zhao et al., 2025), τ -Bench (Yao et al.,
and act on user preferences. For instance, scores 2024), and τ 2 -Bench (Barres et al., 2025) focus on
drop by over 40% on average when models are re- dynamic, multi-turn interactions, testing whether
stricted to selecting only one option per traveling agents can incorporate feedback, handle evolving
aspect in UserBench, revealing their difficulty in preferences, and maintain user alignment over time.
making optimal decisions. Moreover, models pro- However, several limitations remain. For exam-
vide answers that fully align with all user intents ple, a large portion of interactions are automati-
only 20% of the time on average, and even the best- cally synthesized in these benches, often resulting
performing models elicit less than 30% of all user in overly lengthy and unnatural goal formulations
preferences through active querying, suggesting that diverge from how users typically express them-
limited ability to engage in purposeful, user-driven selves. In contrast, UserBench explicitly models
dialogue. While strong models handle tool use reli- the three core interactive traits of user communi-
ably, they remain brittle in understanding implicit cation that are critical for real-world alignment. It
and nuanced human needs, highlighting the core also spans hundreds of diverse goal configurations,
challenge that UserBench aims to measure. We enabling fine-grained evaluation on a wide range of
summarize our contributions as follows: user intents, thus providing a more comprehensive
• Data: We propose 4K+ scenarios that capture and behaviorally grounded setting for assessing
grounded communication challenges (underspec- user-centric agents. We provide a comprehensive
ification, incrementality, indirectness) through a comparison of our gym environment with recent
carefully curated pipeline. environments and benchmarks in Table 1.
• Environment: We introduce UserBench, a scal-
able and modular gym environment designed User-centric agent designs. Designing agents
both as a benchmark and a training ground for that collaborate effectively with users requires mod-
LLM agents engaged in multi-turn, preference- eling ambiguity, evolving intent, and user-specific
driven user interactions. preferences. Standard instruction-tuned models of-
ten hallucinate intent or avoid asking clarifying
• Analysis: We show that despite strong tool use
questions. Recent work aims to teach agents to
capability, current models still struggle to com-
proactively clarify underspecified instructions. For
prehensively uncover and align with user intent,
instance, models trained with simulated ambiguity
revealing limits in user-centric reasoning.
resolution (Zhang et al., 2024c; Chen et al., 2025)
We view this work as a foundational step to- better recognize when to ask versus answer. Other
ward truly user-centric agents: not just efficient work tackles user modeling directly: TravelPlan-
executors, but collaborative teammates capable of ner+ (Singh et al., 2024) and PRELUDE (Gao et al.,
aligning with nuanced human intent. 2024) build agents that personalize responses us-
ing either explicit profiles or latent preferences in-
2 Related Work
ferred from user edits. While these designs make
User-centric environments for LLM evaluation. progress toward adaptive, user-aware agents, they
Recent work has increasingly emphasized the im- often target narrow personalization tasks or rely on
portance of evaluating LLMs under realistic, user- static user models; in contrast, our work provides
centric conditions. Traditional benchmarks as- a general-purpose evaluation environment that sys-
sess task success assuming fully specified prompts, tematically tests an agent’s ability to dynamically
overlooking how users often communicate under- uncover and respond to emergent user intent across
specified, incremental, or indirect goals. To ad- varied and customizable interaction scenarios.

2
Multi-turn Goal Tool Dynamic Multi-Aspect User Domain Custom- Scal-
Benchmark
Interaction Ambiguity Use State Reasoning Simulation Diversity izable able
τ 2 -Bench(Barres et al., 2025) ✓ ✗ ✓ ✓ ✗ ✓ ✗ ✓ ✓
τ -Bench(Yao et al., 2024) ✓ ✗ ✓ ✓ ✓ ✓ ✓ ✓ ✓
ToolSandbox(Lu and et al., 2024) ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
MINT(Wang et al., 2024b) ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
IN3(Qian et al., 2024b) ✓ ✓ ✗ ✗ ✗ ✓ ✓ ✗ ✗
ToolTalk(Farn and Shin, 2023) ✓ ✗ ✓ ✗ ✗ ✗ ✓ ✗ ✗
API-Bank(Li et al., 2023) ✗ ✗ ✓ ✗ ✗ ✗ ✓ ✓ ✓
MetaTool(Huang et al., 2023) ✗ ✗ ✓ ✗ ✗ ✗ ✓ ✓ ✓
UserBench (Ours) ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

Table 1: For each existing benchmark, the table indicates whether the corresponding trait is fully addressed (✓),
partially addressed (✓), or not addressed (✗). We provide detailed explanation for each trait being compared in
Appendix A.

Figure 1: The pipeline of UserBench Construction, including data gathering, preference curation, tool augmentation,
environment setup and interface design. Random Preference Combinations makes our pipeline highly scalable and
enables the efficient generation of even 10K+ data points.

3 UserBench we aim for diversity by collecting around a hundred


distinct preferences that span travel aspects. Third,
We build UserBench upon the standardized gym- we emphasize implicitness by pairing each pref-
nasium environment, focusing on travel planning erence with multiple naturally phrased statements
tasks where users gradually reveal their prefer- that express the intent indirectly. For instance, the
ences in implicit ways. Compared to prior work, preference for direct flights might be expressed
UserBench features a significantly more diverse as: “I always keep my schedule packed tight, so
and grounded collection of user preferences, each I prefer travel routes that minimize transit time.”
paired with carefully curated implicit expressions, These implicit expressions are carefully curated
and supports tool-augmented search with injected and validated to ensure that (i) they do not reveal
noise and cost-awareness. This setting challenges the preference explicitly, and (ii) they still clearly
agents to perform search and reasoning in a user- imply the intended preference without ambiguity.
centric manner, under constraints of partial, evolv-
ing, and indirect preference information. We de-
scribe the gym design details in the following.
These curated preferences and their associated
3.1 Data Gathering implicit expressions serve as the foundation for
As shown in Figure 1, we begin by identifying generating travel scenarios. We randomly sample
five core aspects of travel planning: flight, hotel, and combine preferences across the five aspects,
apartment, car rental, and restaurant. Our data gath- categorizing the resulting data into three difficulty
ering is guided by three key principles: realism, tiers based on the number of preferences involved.
diversity, and implicitness. First, we ensure real- For each combination, we generate an ideal travel
ism by curating preferences that reflect plausible, arrangement statement that incorporates all relevant
real-world user needs—such as preferring direct preferences using GPT-4o. This finally forms a data
flights (flight) or king-sized beds (hotel). Second, point in our environment.

3
3.2 Tool Augmentation 3.4 Agent Interaction
For each travel planning aspect, we develop a cor- We provide a standardized interaction interface for
responding tool that simulates a database search. agents, supporting three types of actions: action,
For example, the flight search tool accepts origin, search, and answer. Through search, agents is-
destination, and date, returning a list of mock op- sue queries specifying the travel planning aspect
tions. Instead of querying real-time data, we rely and arguments (e.g., hotel with specific dates),
on pre-generated options to ensure stable and con- which are matched against ground truth. If the ar-
trolled tool outputs. This design choice is justified guments match, the environment returns the mixed
because (i) it guarantees consistency and quality of set of relevant options; otherwise, it returns nothing.
returned results, and (ii) our focus is user-centric Through action, the agent communicates with the
reasoning rather than challenges of online search. user, typically by asking clarifying questions. The
Each data point is associated with a tailored environment interprets the intent and responds nat-
database of options for all relevant travel planning urally, potentially revealing a preference in an im-
aspects. Specifically, for each aspect, we generate plicit form. Finally, through answer, the agent
three types of options: correct (satisfying all pref- selects its recommended option ID(s), which are
erences), wrong (violating at least one preference), evaluated against the ground-truth correct options.
and noise (incomplete or irrelevant to the query, Since ideal itineraries often span multiple aspects,
such as a flight option with a destination different correct answers may include several options.
from the one the user searched for). These options
are mixed to form the complete search space, with 3.5 Variants, Scalability and Statistics
each data point containing up to over 100 options. UserBench supports several extensions. First, it can
Option generation is handled by GPT-4o under hu- simulate a noisy search environment where query
man supervision to ensure quality and diversity. results are corrupted at a configurable frequency,
mimicking real-world web or tool errors. Second,
3.3 Environment Building it allows for budget-aware decision-making by re-
Each data point in the environment contains a travel warding models that not only find correct options
scenario, a set of implicit user preferences, and a but also select the most cost-effective ones (best op-
database of relevant options. The environment sim- tion). A hyperparameter can control the strength of
ulates an oracle user with access to these internal this reward, encouraging agents to optimize toward
states, while engaging in multi-turn interactions practical utility.
with a agent model under test. Upon reset, the As illustrated in Figure 1, the Random Prefer-
environment provides only basic trip information ence Combination strategy enhances both the di-
and high-level requirements (e.g., hotel or flight), versity and scalability of our data construction, en-
without disclosing any of its specific preferences. abling over 10K distinct scenarios. To manage com-
Preferences are then elicited over time, either (i) putational costs during benchmarking, we use 417
when the tested model explicitly asks about a rel- data points for testing and reserve 2651 for training,
evant aspect, or (ii) after a fixed number of turns which are not used in benchmarking but may sup-
without progress (see Appendix B for details), at port future model training (see Discussion section).
which point a preference is randomly and proac- Detailed statistics of UserBench are presented in
tively revealed. All elicited preferences are pre- Table 2. For instance, “Travel-223” denotes a sce-
sented implicitly, in line with the associated data. nario with three travel aspects: two containing two
This design captures the three key traits and chal- implicit preferences each, and one with three. Addi-
lenges we previously identify: underspecification, tional details on data and environment construction
where only coarse needs are initially provided; in- are provided in the Appendix B.
crementality, where user preferences emerge pro-
gressively as the conversation evolves; and indi- 4 Experiments
rectness, where preferences are stated implicitly
4.1 Settings
through natural, carefully-curated, heuristic-based
language. These properties reflect real-world user Gym Settings. We adopt both variants (corrupted
behaviors and make UserBench a robust testbed for search, budget constraint) to increase test difficulty.
user-centric reasoning. GPT-4o with a temperature of 0.0 serves as the

4
Difficulty Tier # Test # Train Composition Global Dataset Metadata Travel Aspect # Scenarios Preferences
Easy 118 666 Travel-22, 2222 Travel Aspects 5 Hotel 187 (39.7%) 15
Medium 201 1130 Travel-33, 233, 333 Pref. Categories 24 Restaurant 298 (63.3%) 19
Hard 152 855 Travel-44, 334, 444 Total Prefs. 82 Flight 275 (58.4%) 16
—————————————————————————– Elicitation Ways 902 Rental Car 243 (51.6%) 14
Total 471 2651 UserBench Test / Train Options (B/C/W/N) 1/3/10/5 Apartment 172 (36.5%) 18

Table 2: UserBench Statistics: Difficulty-tiered statistics (left), global configuration metadata (middle), and travel
aspect coverage (right). Pref. denotes preference; B/C/W/N stands for Best, Correct, Wrong, and Noise options
respectively; # Scenarios indicates the number of data scenarios in which each travel aspect appears.

Search Attempt Action Attempt Preference Elicited (%)


Model Name Best Exist Rate Correct Exist Rate Score
Valid (%) Valid (%) (Active / Passive)
GPT-4o 0.204 0.361 0.329 82.48 27.82 27.32 (24.06 / 3.26)
Gemini-2.5-Pro 0.245 0.328 0.317 77.40 29.26 29.71 (23.85 / 5.85)
Claude-4-Sonnet 0.260 0.318 0.307 74.28 24.26 34.25 (26.31 / 7.94)
Deepseek-V3 0.148 0.218 0.210 61.22 23.62 31.82 (18.90 / 12.92)
Qwen3-14B 0.157 0.197 0.209 70.56 33.71 30.07 (15.95 / 14.12)
Qwen3-32B 0.154 0.211 0.206 79.42 26.89 22.01 (13.04 / 8.96)
Llama-3.3-70B 0.120 0.209 0.198 60.51 11.29 23.25 (13.05 / 10.20)
Qwen3-8B 0.140 0.186 0.180 68.94 23.20 28.86 (11.15 / 17.70)
GPT-4o-mini 0.102 0.173 0.166 76.10 13.55 25.43 (12.91 / 12.52)
Llama-3.1-8B 0.073 0.171 0.159 57.99 12.13 19.87 (8.09 / 11.78)
Gemini-2.5-Flash 0.047 0.080 0.125 83.62 28.14 14.00 (11.15 / 2.86)

Table 3: UserBench main evaluation results across different models (single-choice setting).

Search Attempt Action Attempt Preference Elicited (%)


Model Name Best Exist Rate Correct Exist Rate Score
Valid (%) Valid (%) (Active / Passive)
GPT-4o 0.652 0.725 0.710 85.96 37.89 15.10 (13.11 / 1.99)
Gemini-2.5-Pro 0.604 0.685 0.673 80.43 38.01 37.25 (28.80 / 8.46)
Claude-4-Sonnet 0.586 0.619 0.612 87.94 41.17 42.21 (30.67 / 11.54)
Qwen3-32B 0.343 0.416 0.411 82.76 34.24 17.72 (10.81 / 6.90)
Deepseek-V3 0.351 0.398 0.391 61.91 21.06 16.96 (10.18 / 6.78)
Llama-3.3-70B 0.294 0.366 0.372 63.46 21.72 31.72 (16.68 / 15.04)
Gemini-2.5-Flash 0.279 0.316 0.311 84.27 33.58 15.32 (11.67 / 3.64)
GPT-4o-mini 0.227 0.294 0.283 81.54 24.90 20.18 (10.46 / 9.72)
Qwen3-14B 0.204 0.249 0.258 71.48 35.06 26.21 (13.07 / 13.14)
Llama-3.1-8B 0.117 0.252 0.234 61.38 12.20 26.01 (7.80 / 18.20)
Qwen3-8B 0.147 0.186 0.182 69.83 23.26 20.31 (7.95 / 12.37)

Table 4: UserBench main evaluation results across different models (multi-choice setting).

user simulator for all conversations. In the stan- for each aspect of a travel scenario. For each as-
dard setup, we limit the maximum number of con- pect, if the model selects the best option (as judged
versation turns to 20. We evaluate models under by reward), it receives a score of 1.0. If the op-
two settings: (1) a single-choice setting (our main tion is correct but not the best, it receives 0.8. All
evaluation), where the model is allowed to output other options receive 0.0. We compute the highest
only one option for each travel aspect, and (2) a reward score among the options selected for each
multi-choice setting, where the model may output aspect (in the multi-choice setting) and average this
multiple options, and we evaluate it based on the over all aspects in the scenario. For example, if a
option that achieves the highest reward. scenario has two aspects, and the model selects
options with rewards 1.0 and 0.8 for the first aspect,
Models. Our evaluation includes both closed- and 0.0, 0.8, and 0.0 for the second, the score is:
source and open-source models. Closed-source
models include the GPT, Claude, Deepseek, and
max(1.0, 0.8) + max(0.0, 0.8, 0.0)
Gemini families, while open-source models include Final Score = = 0.9
2
the Qwen3 and Llama3 families, with model sizes
ranging from 8B to 70B. All models generate re- In the single-choice setting, only the first option
sponses using a temperature of 0.0 to ensure deter- per aspect is considered, so the max function does
ministic behavior. not apply and the score is averaged only for the first
choice of each aspect. In addition, we also report
Metrics. The main evaluation metric is a normal- several auxiliary metrics to better understand model
ized score based on the quality of selected options behavior (all micro-averaged):

5
• Best Exist Rate: The proportion of aspects where
the model includes the best option among its se-
lected choices.
• Correct Exist Rate: The proportion of aspects
where the model includes one correct (not neces-
sarily best) option.
• Valid Search Attempt (%): The rate at which
Figure 2: The score distribution of tested models across
the model’s search queries are syntactically valid.
three difficulty tiers.
• Valid Action Attempt (%): The rate at which
the model’s actions successfully probe the user’s
real preferences. Understanding users is harder than executing
tool use. Top-performing models maintain over
• Preference Elicited (%): The percentage of all
80% success in valid search attempts, but the rate
ground-truth preferences revealed during the con-
of valid action attempts is much lower. This aligns
versation. It includes Active elicitation, where
with UserBench’s goal of shifting difficulty from
preferences are disclosed in response to valid ac-
tool use to user understanding. Note that valid
tion attempts by the tested model, and Passive
action attempt rate is computed only over action
elicitation, where preferences are released by the
steps (excluding search and answer), so a higher
UserBench to guide the conversation when the
rate does not necessarily reflect a higher absolute
model becomes too off-topic.
number of valid queries overall. Still, it indicates
Please refer to Appendix C for a more detailed that the model’s questions are more precise and
explanation of each metric. preference-relevant, rather than vague or off-topic,
4.2 Results which we classify as invalid attempts.

We present the results for the two settings in Table 3 Other Interesting Findings. Some models, such
and Table 4, and summarize our key findings below. as Gemini-2.5-Flash, achieve high Action Attempt
Single-choice setting is significantly more chal- Valid rates but relatively low Active Preference
lenging. On average, scores drop by approxi- Elicited. This suggests they can ask effective clar-
mately 40% when switching from the multi-choice ifying questions but often fail to do so repeatedly
to the single-choice setting. This highlights the or comprehensively, limiting their ability to cap-
difficulty models face in selecting the best, or even ture the full scope of user preferences. Please see
one of the three correct, options with only one an- the Appendix C for detailed metric explanations to
swer attempt. When allowed to propose multiple help understanding this result. In contrast, Claude-
answers, we observe that the performance gener- 4-Sonnet performs well on both metrics, indicating
ally improves, indicating that having more chances strong capability in intent clarification. However,
increases the likelihood of hitting a correct answer. its overall score is not the highest. This highlights
a gap between understanding user preferences and
Preference elicitation remains low across mod- effectively integrating them into decision-making.
els. One might expect that performance gains in In other words, even when models excel at eliciting
the multi-choice setting stem from better user un- preferences, they may still struggle to leverage that
derstanding. However, we find that preference elic- information in their reasoning to deliver optimal
itation rates do not significantly improve, and in recommendations.
some cases, such as GPT-4o and Deepseek-V3,
even decline. This suggests that higher scores more 5 Analysis
often result from random guessing or brute-force
coverage rather than active reasoning. Moreover, All analyses are conducted under the single-choice
the overall user preference elicitation rate remains setting, which serves as our main evaluation setup,
low across models, especially for preferences re- and focus on the score as the primary metric. The
vealed through active queries. This indicates that turn-based, pass-k sampling, and choice ablation
current models still struggle to proactively and ef- analyses are performed specifically on data points
fectively uncover user needs in interactive settings. from the Travel-22, 33, and 44 scenarios.

6
Model Name First Index↓ Weighted Score↑ Correct Exists Rate↑
GPT-4o 3.846 0.054 0.361
Gemini-2.5-Pro 4.476 0.044 0.328
Qwen3-8B 2.005 0.039 0.186
Qwen3-32B 2.739 0.039 0.211
Claude-4-Sonnet 5.018 0.038 0.318
Deepseek-V3 2.740 0.037 0.218
Qwen3-14B 2.650 0.035 0.197
Llama-3.3-70B 2.776 0.032 0.209
Llama-3.1-8B 1.884 0.028 0.171
GPT-4o-mini 2.185 0.028 0.173
Gemini-2.5-Flash 0.927 0.014 0.080

Table 5: Weighted Timing Analysis: Models evalu-


ated by timing of highest scores and coverage. Weight
function w(i) = 1/(i + 1) penalizes later discoveries of
correct or best user-aligned answer options. Coverage
is the micro-average of aspects with valid answers.

Figure 3: When aspect number is settled, more user major challenge for current models.
preferences generally lead to less scores. Next, we fix the total number of preferences in a
scenario and vary how they are distributed across
aspects. As shown in Figure 4, performance im-
proves when preferences are more evenly spread
across multiple aspects, rather than concentrated
within a few. This implies that models reason more
effectively when each aspect involves fewer, sim-
pler preferences. Concentrating multiple prefer-
ences into a single aspect appears to overload the
Figure 4: When total user preferences are settled, less model’s local reasoning process. Together, these
preferences per aspect generally lead to higher scores. results suggest that the number of preferences per
aspect is the main driver of difficulty in UserBench.

Difficulty tier division effectively reflects User- Models struggle to provide answers that are
Bench’s challenges. We begin by analyzing how both correct and timely. To assess not only
scenario difficulty impacts model performance, as whether models find correct or best answers, but
shown in Figure 2. Following the tiering in Ta- also when they do so, we conduct a weighted tim-
ble 2, we divide test scenarios into Easy, Medium, ing analysis in Table 5. Using a weight function
and Hard based on preference complexity. Results w(i) = 1/(i + 1), we penalize delayed discover-
show a general downward trend in scores as diffi- ies, where i is the turn at which a valid answer
culty increases. This confirms that our difficulty (reward > 0) first appears. Notably, open-source
stratification captures real reasoning challenges for Qwen models outperform Deepseek in timing, de-
models. Furthermore, the performance drop within spite Deepseek ranking higher in overall accuracy
each model across tiers reveals a lack of robustness in Table 3. This suggests that while Deepseek even-
in managing complex user interactions, which is tually finds good answers, it is less efficient in
something humans typically excel at. doing so. In contrast, GPT-4o and Gemini-2.5-Pro
balance both coverage and timing better, achiev-
The key challenge lies in the number of pref- ing higher Correct Exist Rates and also Weighted
erences per aspect. To identify the core source Score overall.
of difficulty in UserBench, we investigate whether We also report the average turn index at which
performance is more affected by the number of a valid answer first appears. Smaller models, like
travel aspects or the number of preferences per as- Llama-3.1-8B and Gemini-2.5-Flash, show lower
pect. We first fix the number of aspects and vary average indices, indicating the successful attempts
the number of preferences per aspect. As shown of their earlier guesses. However, these guesses
in Figure 3, model scores consistently decline as often fail, leading to lower Weighted Scores and
the number of preferences increases. This suggests limited Correct Exist Rates. This reveals a com-
that handling richer user preference signals is a mon failure mode: smaller models rely on shallow

7
Model Name Original (w10n5) w10n0 w5n5 w8n2
GPT-4o 0.340 0.344↑0.004 0.360↑0.020 0.350↑0.010
Deepseek-V3 0.223 0.271↑0.048 0.249↑0.027 0.241↑0.018
Qwen3-32B 0.210 0.220↑0.010 0.241↑0.031 0.224↑0.014
Qwen3-14B 0.214 0.225↑0.011 0.199↓0.015 0.203↓0.011

Table 6: Impact of choice distribution on model scores.


“w10n5” indicates a setting where, in addition to the
correct and best options, there are 10 wrong and 5 noise
options as distractors. Each cell reports the model’s
Figure 5: Increasing the number of turns allowed in in- score and its change relative to the original setup.
teraction does not necessarily lead to better performance
across models.
clines. This suggests that more samples increase
the chance of stumbling upon a good response, but
most outputs remain poor. The results highlight
a key limitation: current models rely heavily on
sampling luck rather than robust reasoning to align
with user preferences. Given the high cost of pass-
k sampling, this instability poses a challenge for
deploying reliable user-interactive systems.

Figure 6: Increasing the number of sampling time Less options reduces interference but do not re-
raises the maximum score but average score shows little duce the core challenge. We finally perform an
change or even slightly drops. ablation study to assess how the number of wrong
and noise options affects model performance. As
heuristics or early guesses rather than deeply engag- shown in Table 6, reducing these distractors gen-
ing with user intent. While this may occasionally erally improves scores. However, even for strong
yield fast correct answers, it more often results in models like Deepseek-V3 and GPT-4o, removing
low-quality answers not aligned with user needs. five wrong or noise options (about 30% of the to-
These findings also help explain the low Prefer- tal) yields only modest gains. Interestingly, some
ence Elicited in Table 3 and Table 4. Ultimately, a smaller models, such as Qwen-14B, even show
delicate trade-off emerges: models are expected to slight performance drops. These results suggest
be efficient in interaction, yet still they should in- that (1) reducing options does not fundamentally
vest enough turns in interaction to truly understand lower task difficulty: fully understanding user pref-
the user’s evolving needs or preferences. erences remains essential, as only one best op-
tion satisfies all constraints (budget constraint in-
More interaction turns do not guarantee better cluded); and (2) many models still rely on shal-
performance. We also examine whether increas- low guessing rather than genuine preference under-
ing the number of interaction turns improves per- standing, which limits their ability to benefit from
formance, as illustrated in Figure 5. Surprisingly, reduced distractors. These results also indirectly
simply allowing more turns does not lead to con- highlights the intentional complexity designed into
sistent gains, and in some cases, performance even our option sets for each data scenario.
degrades. This suggests that many models fail to
leverage the extended interaction window to elicit 6 Discussions
preferences or refine understanding. Instead, longer
Broad applicability of UserBench. UserBench
conversations often result in repetitive or off-topic
is implemented as a standard Gym environment, of-
dialogue. Without strong dialogue planning and
fering several advantages: (1) it exposes a familiar
goal-tracking, more turns alone are insufficient to
API (e.g., reset, step), allowing seamless integra-
improve performance.
tion with other Gym-compatible systems; and (2) it
Effect of sampling frequency reveals instability. abstracts user behavior as a black-box environment,
We analyze the effect of sampling frequency using enabling agents to interact directly without requir-
pass-k evaluation, as shown in Figure 6. While ing internal knowledge of user logic. This setup
the maximum score across k samples steadily in- mirrors real-world scenarios where agents must
creases, the average score remains flat or even de- infer user intent without full transparency, reduc-

8
ing the model’s cognitive burden and encouraging fully elicited, and penalties applied to correct an-
generalizable behavior. swers that lack sufficient supporting interaction.
These traits also make UserBench friendly for These strategies discourage guessing and encour-
both evaluation and training: (1) As a benchmark, age deliberate, user-aware interactions. UserBench
it enables flexible testing configurations, includ- supports such reward customizations, providing a
ing customizable option sets, reward functions flexible platform for training agents that are not
(e.g., answer correctness, search and action va- only capable, but also user-aligned.
lidity, penalties, etc.), and user feedback mecha-
nisms (e.g., rule-based, user simulated, etc.). This 7 Conclusion
allows for fine-grained agent behavioral analysis We introduce UserBench, a user-centric environ-
under varied user profiles. (2) As a training en- ment designed to evaluate and facilitate agents’
vironment, UserBench can be used by any model ability to understand, interact with, and adapt to
with tool use capabilities via a standardized inter- real-world user communication, which often in-
action interface we provide. It supports both su- volves underspecification, incrementality, and indi-
pervised fine-tuning and reinforcement learning, rectness. By simulating realistic preference expres-
especially multi-turn RL, by providing turn-wise sion in a controlled travel planning environment,
rewards and partial credit signals, which is criti- UserBench exposes key limitations of current LLM
cal for improving robustness in user-aligned inter- agents: while they are proficient in tool use, they
action. This also makes UserBench particularly struggle to uncover evolving user intent and re-
well-suited to the emerging trend of applying RL spond effectively to implicit signals. Our findings
to agentic LLMs (Jin et al., 2025; Li et al., 2025; highlight a critical gap between technical execution
Qian et al., 2025a). and communicative intelligence in LLMs. Looking
forward, we envision a new generation of agents
Balancing efficiency and effectiveness in user in-
that move beyond task completion, but agents that
teraction. Our weighted timing analysis reveals
actively collaborate, interpret subtle cues, and align
that many models, including GPT-4o, often prior-
with users through meaningful, adaptive interac-
itize efficiency by guessing answers early rather
tions. We see UserBench as a foundational step
than thoroughly probing user preferences. As a re-
toward this vision, and we hope it inspires future
sult, while agents may appear helpful by providing
work on building agents that are not only capable
quick responses, they frequently overlook deeper
but also truly cooperative.
user intent. This mirrors real-world experiences
where users receive seemingly helpful answers but
still feel misunderstood, leading to repeated prompt References
revisions and dissatisfaction.
Victor Barres, Honghua Dong, Soham Ray, Xujie Si,
Conversely, we rarely observed the opposite fail- and Karthik Narasimhan. 2025. tau2-bench: Evaluat-
ure mode, where models hesitating indefinitely and ing conversational agents in a dual-control environ-
over-requesting information. Yet both extremes ment. arXiv preprint arXiv:2506.07982.
underscore the need to balance efficiency (respond-
Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen,
ing promptly) and effectiveness (satisfying user and Denny Zhou. 2024. Large language models as
needs). This trade-off also reflects a broader con- tool makers. In The Twelfth International Conference
cern around tool use efficiency, which has received on Learning Representations.
increasing attention in recent works (Qian et al.,
Maximillian Chen, Ruoxi Sun, Tomas Pfister, and
2025b; Wang et al., 2025). While UserBench Sercan O Arik. 2025. Learning to clarify: Multi-
does not explicitly quantify this balance, it sur- turn conversations with action-based contrastive self-
faces model behaviors that targeted training can training. In Proceedings of ICLR.
effectively address. Future work, especially using
H.H. Clark. 1996. Using Language. ACLS Humanities
RL, can shape agent behavior toward this balance. E-Book. Cambridge University Press.
To promote efficiency, reward functions can pe-
nalize late preference discovery by decaying re- Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam
Stevens, Boshi Wang, Huan Sun, and Yu Su. 2024.
wards based on the turn when the correct answer Mind2web: Towards a generalist agent for the web.
is given. To promote effectiveness, partial rewards Advances in Neural Information Processing Systems,
can be granted for each user preference success- 36.

9
Nicholas Farn and Richard Shin. 2023. Tooltalk: Eval- Alisa Liu, Zhaofeng Wu, Julian Michael, Alane Suhr,
uating tool-usage in a conversational setting. arXiv Peter West, Alexander Koller, Swabha Swayamdipta,
preprint arXiv:2311.10775. Noah A Smith, and Yejin Choi. 2023. We’re afraid
language models aren’t modeling ambiguity. In The
Ge Gao, Alexey Taymanov, Eduardo Salinas, Paul 2023 Conference on Empirical Methods in Natural
Mineiro, and Dipendra Misra. 2024. Aligning llm Language Processing.
agents by learning latent preference from user ed-
its. In Advances in Neural Information Processing Zuxin Liu, Thai Hoang, Jianguo Zhang, Ming Zhu,
Systems (NeurIPS). Tian Lan, Juntao Tan, Weiran Yao, Zhiwei Liu, Yi-
hao Feng, Rithesh RN, et al. 2024. Apigen: Auto-
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, mated pipeline for generating verifiable and diverse
Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, function-calling datasets. Advances in Neural Infor-
Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: In- mation Processing Systems, 37:54463–54482.
centivizing reasoning capability in llms via reinforce-
ment learning. arXiv preprint arXiv:2501.12948. Jiarui Lu and et al. 2024. Toolsandbox: A state-
ful, conversational, interactive evaluation bench-
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, mark for llm tool use capabilities. arXiv preprint
Kai Dong, Wentao Zhang, Guanting Chen, Xiao arXiv:2408.04682.
Bi, Yu Wu, YK Li, et al. 2024. Deepseek-coder: Yaxi Lu, Shenzhi Yang, Cheng Qian, Guirong Chen,
When the large language model meets programming– Qinyu Luo, Yesai Wu, Huadong Wang, Xin Cong,
the rise of code intelligence. arXiv preprint Zhong Zhang, Yankai Lin, et al. 2025. Proactive
arXiv:2401.14196. agent: Shifting llm agents from reactive responses
to active assistance. In The Thirteenth International
Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan Conference on Learning Representations.
Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao Wan,
Neil Zhenqiang Gong, et al. 2023. Metatool bench- Akshara Prabhakar, Zuxin Liu, Ming Zhu, Jianguo
mark for large language models: Deciding whether Zhang, Tulika Awalgaonkar, Shiyu Wang, Zhiwei
to use tools and which to use. arXiv preprint Liu, Haolin Chen, Thai Hoang, Juan Carlos Niebles,
arXiv:2310.03128. et al. 2025. Apigen-mt: Agentic pipeline for multi-
turn data generation via simulated agent-human in-
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Day- terplay. arXiv preprint arXiv:2504.03601.
iheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang,
Bowen Yu, Keming Lu, et al. 2024. Qwen2. 5-coder Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang,
technical report. arXiv preprint arXiv:2409.12186. Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and
Heng Ji. 2025a. Toolrl: Reward is all tool learning
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- needs. arXiv preprint arXiv:2504.13958.
son, Ahmed El-Kishky, Aiden Low, Alec Helyar,
Aleksander Madry, Alex Beutel, Alex Carney, et al. Cheng Qian, Emre Can Acikgoz, Hongru Wang, Xiusi
2024. Openai o1 system card. arXiv preprint Chen, Avirup Sil, Dilek Hakkani-Tür, Gokhan
arXiv:2412.16720. Tur, and Heng Ji. 2025b. Smart: Self-aware
agent for tool overuse mitigation. arXiv preprint
Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, arXiv:2502.11435.
Hamed Zamani, and Jiawei Han. 2025. Search-
r1: Training llms to reason and leverage search en- Cheng Qian, Hongyi Du, Hongru Wang, Xiusi Chen,
gines with reinforcement learning. arXiv preprint Yuji Zhang, Avirup Sil, Chengxiang Zhai, Kathleen
arXiv:2503.09516. McKeown, and Heng Ji. 2025c. Modelingagent:
Bridging llms and mathematical modeling for real-
Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, world challenges. arXiv preprint arXiv:2505.15068.
Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, Cheng Qian, Chi Han, Yi Fung, Yujia Qin, Zhiyuan
and Yongbin Li. 2023. Api-bank: A comprehensive Liu, and Heng Ji. 2023. Creator: Tool creation for
benchmark for tool-augmented llms. In Proceedings disentangling abstract and concrete reasoning of large
of the 2023 Conference on Empirical Methods in language models. In Findings of the Association
Natural Language Processing, pages 3102–3116. for Computational Linguistics: EMNLP 2023, pages
6922–6939.
Xuefeng Li, Haoyang Zou, and Pengfei Liu. 2025.
Torl: Scaling tool-integrated rl. arXiv preprint Cheng Qian, Peixuan Han, Qinyu Luo, Bingxiang He,
arXiv:2503.23383. Xiusi Chen, Yuji Zhang, Hongyi Du, Jiarui Yao, Xi-
aocheng Yang, Denghui Zhang, et al. 2024a. Es-
Bill Yuchen Lin, Yuntian Deng, Khyathi R Chandu, capebench: Pushing language models to think outside
Faeze Brahman, Abhilasha Ravichander, Valentina the box. arXiv preprint arXiv:2412.13549.
Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin
Choi. 2024. Wildbench: Benchmarking llms with Cheng Qian, Bingxiang He, Zhong Zhuang, Jia Deng,
challenging tasks from real users in the wild. arXiv Yujia Qin, Xin Cong, Zhong Zhang, Jie Zhou,
preprint arXiv:2406.04770. Yankai Lin, Zhiyuan Liu, et al. 2024b. Tell me

10
more! towards implicit user intention understand- Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai
ing of language model driven agents. arXiv preprint Hoang, Shirley Kokane, Weiran Yao, Juntao Tan,
arXiv:2402.09205. Akshara Prabhakar, Haolin Chen, et al. 2024a. xlam:
A family of large action models to empower ai agent
Cheng Qian, Shihao Liang, Yujia Qin, Yining Ye, systems. arXiv preprint arXiv:2409.03215.
Xin Cong, Yankai Lin, Yesai Wu, Zhiyuan Liu,
and Maosong Sun. 2024c. Investigate-consolidate- Kexun Zhang, Weiran Yao, Zuxin Liu, Yihao Feng, Zhi-
exploit: A general strategy for inter-task agent self- wei Liu, Rithesh Murthy, Tian Lan, Lei Li, Renze
evolution. arXiv preprint arXiv:2401.13996. Lou, Jiacheng Xu, et al. 2024b. Diversity empowers
intelligence: Integrating expertise of software engi-
Cheng Qian, Chenyan Xiong, Zhenghao Liu, and neering agents. arXiv preprint arXiv:2408.07060.
Zhiyuan Liu. 2024d. Toolink: Linking toolkit cre-
ation and using through chain-of-solving on open- Michael JQ Zhang, WB Knox, and Eunsol Choi.
source model. In Proceedings of the 2024 Confer- 2024c. Modeling future conversation turns to teach
ence of the North American Chapter of the Associ- llms to ask clarifying questions. arXiv preprint
ation for Computational Linguistics: Human Lan- arXiv:2410.13788.
guage Technologies (Volume 1: Long Papers), pages Siyan Zhao, Mingyi Hong, Yang Liu, Devamanyu Haz-
831–854. arika, and Kaixiang Lin. 2025. Do llms recognize
your preferences? evaluating personalized preference
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, following in llms. In Proceedings of ICLR.
Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan
Zhang, YK Li, Yang Wu, et al. 2024. Deepseekmath: Kunlun Zhu, Hongyi Du, Zhaochen Hong, Xiaocheng
Pushing the limits of mathematical reasoning in open Yang, Shuyi Guo, Zhe Wang, Zhenhailong Wang,
language models. arXiv preprint arXiv:2402.03300. Cheng Qian, Xiangru Tang, Heng Ji, et al. 2025.
Multiagentbench: Evaluating the collaboration
Harmanpreet Singh, Nikhil Verma, Yixiao Wang, Man- and competition of llm agents. arXiv preprint
asa Bharadwaj, Homa Fashandi, Kevin Ferreira, and arXiv:2503.01935.
Chul Lee. 2024. Personal large language model
agents: A case study on tailored travel planning. In
Proceedings of EMNLP (Industry Track).

Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen,


Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang,
Kam-Fai Wong, and Heng Ji. 2025. Acting less is
reasoning more! teaching model to act efficiently.
arXiv preprint arXiv:2504.14870.

Jiayin Wang, Fengran Mo, Weizhi Ma, Peijie Sun, Min


Zhang, and Jian-Yun Nie. 2024a. A user-centric
multi-intent benchmark for evaluating large language
models. In Proceedings of EMNLP.

Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen,


Lifan Yuan, Hao Peng, and Heng Ji. 2024b. Mint:
Evaluating llms in multi-turn interaction with tools
and language feedback. In Proceedings of ICLR.

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao,


Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong
Tu, Jingren Zhou, Junyang Lin, et al. 2024. Qwen2.
5-math technical report: Toward mathematical ex-
pert model via self-improvement. arXiv preprint
arXiv:2409.12122.

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik


Narasimhan. 2024. tau-bench: A benchmark for tool-
agent-user interaction in real-world domains. arXiv
preprint arXiv:2406.12045.

Lifan Yuan, Yangyi Chen, Xingyao Wang, Yi R. Fung,


Hao Peng, and Heng Ji. 2024. Craft: Customiz-
ing llms by creating and retrieving from specialized
toolsets. In Proc. The Twelfth International Confer-
ence on Learning Representations (ICLR2024).

11
Appendix B Gym Construction Details
Wrong Option Example: Flight Option F2
A Comparison Traits Details
Path: New York → Los Angeles → San Francisco
Time (hours): 5 (Leg 1), 2 (Layover), 1 (Leg 2)
We identify eight core traits that characterize recent Airlines: United Airlines, United Airlines
benchmarks. These traits reflect the agent’s abil- Flight Numbers: UA789, UA321
ity to engage in realistic, multi-step, user-centered Total Cost: $400
Amenities: WiFi, Meal Service, Lounge Access
interactions, and the infrastructure’s capacity to Service Costs:
support scalable, extensible evaluation. Checked Bag Total: $50
Business Class Upgrade: $250
• Multi-turn Interaction: The benchmark re-
quires the agent to conduct extended conversa- Option Type: not-suitable
Reason: This option includes a layover, which does
tions over multiple turns, often involving be- not satisfy the preference for a direct flight.
haviors including clarification, refinement, or
feedback-based improvement. Correct Option Example: Flight Option F14

• Goal Ambiguity: The tasks are initially under- Path: New York → San Francisco
specified, vague, or indirect, requiring the agent Time (hours): 6 (Direct)
Airlines: Delta Airlines
to ask clarifying questions, infer missing infor- Flight Numbers: DL4567
mation, or resolve ambiguity over time. Total Cost: $350
Amenities: WiFi, Meal Service, Carry on Baggage
• Tool Use: The agent must interact with external Allowance
Service Costs:
tools such as APIs, structured search engines, Checked Bag Total: $30
databases, or execution environments to complete Business Class Upgrade: $150
tasks beyond its internal reasoning.
The following content is invisible when presented to
• Dynamic State: The benchmark maintains tested model (serve as label, only for evaluation use)
an evolving internal state (e.g., environment Option Type: suitable
database, user preference profile) that is updated Reason: This option is suitable because it is a direct
based on the agent or user’s actions across time. flight from New York to San Francisco, adhering to
the user’s preference to avoid layovers. It also offers
a business class upgrade, meeting the user’s need
• Multi-Aspect Reasoning: Tasks consist of multi- for extra comfort and space during the flight. The
ple interdependent subgoals (e.g., booking flights, available amenities such as WiFi and meal service
hotels, and cars) that require coordinated reason- add to the travel experience.
ing and decision-making across aspects.
Noise Option Example: Flight Option F10
• User Simulation: The environment includes a
simulated user that responds to agent queries or Path: New York → San Francisco
Time (hours): 1000 (Direct)
actions, enabling interactive learning and evalua- Airlines: Fantasy Air
tion without requiring human annotators. Flight Numbers: FA999
Total Cost: $1,000,000
• Domain Diversity: The benchmark spans mul- Amenities: WiFi, Meal Service, Carry on Baggage
Allowance, Lounge Access
tiple domains, various tools, and different task Service Costs:
structures, ensuring generalization beyond a fixed Checked Bag Total: $0
task type or narrow scenario. Business Class Upgrade: $0

• Customizable: The benchmark framework al- Option Type: noise


lows researchers to adjust task settings, rewards, Reason: This flight path takes an unrealistic 1000
hours, which is unrelated to the user’s request for a
user behavior, or interaction protocols without flight from New York to San Francisco.
redesigning from scratch.
• Scalable: The benchmark supports large-scale, Option Generation. We generate three types of
automated generation of tasks or environments, options: correct, wrong, and noise. Above, we
enabling reproducible experiments and stress- present one example of each type, all drawn from
testing of agent capabilities. the same user scenario.

12
For correct options, under our budget- accurate response.” For Type "4", the environment
constrained setting, there is only one best option: uses the prompt in Figure 11 to continue the con-
the one that incurs the lowest total cost after ac- versation naturally and neutrally, without revealing
counting for all relevant charges. For example, if any specific preferences.
a user prefers a business class seat, the total cost If the model’s utterance has not been judged as
includes both the base fare and any additional busi- Type "1" for several consecutive turns (a hyperpa-
ness class upgrade fees. rameter that can be configured), the environment
Wrong options explicitly violate one or more will proactively use the prompt in Figure 10 to in-
of the user’s preferences, for instance, offering a troduce one of its remaining preferences at random.
layover when the user prefers direct flights. From the model’s perspective, this is treated as
Noise options differ from wrong options in that a passively obtained preference, rather than one
they are either (i) unrelated to the user’s search elicited through a successful action attempt. We
intent (e.g., the flight goes from San Francisco to will discuss this mechanism in more detail when
Los Angeles when the user searched for flights describing the metrics.
from San Francisco to New York), or (ii) unrealistic Finally, if the tested model issues an answer, the
(e.g., an option with an implausible flight time or environment parses the proposed answer in a rule-
cost, such as taking 1000 hours). based manner, compares it with the ground truth,
and returns the corresponding reward.
Environment Setting. The environment simu-
lates a user who responds naturally to the tested C Experiment Details
model’s conversation and queries, gradually and Settings. On the model side, we set the genera-
implicitly revealing preferences over time. When tion temperature to 0, the number of samples to
the tested model issues a search, the environment 1, the maximum response length per turn to 2048
evaluates whether the search query aligns with any tokens, and the maximum number of interaction
of the ground truth arguments using the prompt in turns to 20. We use 8 Nvidia H200 GPUs (1 node)
Figure 7. for the whole evaluation (mainly for hosting open-
If the model issues an action, the environment source models). All the results are produced by a
first analyzes the model’s latest utterance to deter- single run as generation temperature is set to 0.
mine its intent. This intent is categorized into four
Tool Schema: interact_with_env
types using the prompt in Figure 8: (1) the utter-
ance explicitly and concretely asks for a preference Type: function
that the simulated user possesses; (2) the utterance Name: interact_with_env
Description: A tool for interacting with a target envi-
explicitly and concretely asks for a preference, but ronment. The detailed environment description and
the user does not possess a specific preference for action space is provided in the system prompt, so
please follow the system prompt. You can use this
that aspect; (3) the utterance makes a vague and
tool to analyze and interact with the environment
general query about preferences rather than focus- step by step through three actions including search,
ing on a specific aspect; (4) other, indicating a action, or answer.
normal conversation not related to preferences.
For Type "1", the environment uses the prompt in Parameters: object
Figure 9 to implicitly and naturally reveal the cor- thought (string): Your thought of what to do
responding preference. For Type "2", the response next, including your reason or analysis of your choice
and why.
is coded in the environment as: “This is a good
question. However, I do not have specific prefer- choice (string, enum: "action", "answer",
"search"): Your choice of what to do next, must
ence in the aspect you ask about yet (or maybe I be one of action, answer, or search.
have already elicited that to you before). You may
content (string): The content of your choice,
continue to ask me about other detailed and spe- must be a string. If you choose action, provide the
cific preferences.” For Type "3", the response is action you want to take. If you choose answer, pro-
vide the answer you want to submit. If you choose
also coded in the environment as: “Your question search, provide the search query. The specific for-
is too vague and general, and I am not sure how to mat of the content is determined by the environment
respond to it. Please ask me about some specific description, which should be provided in the system
prompt. Please follow the format strictly in order to
aspects of my preferences, in a more detailed and invoke this tool.
concrete way, so that I can provide you with a more

13
The tool schema for interaction is presented Metrics. We provide detailed definitions of the
above. Note that the model is expected and re- auxiliary metrics used to better understand model
quired to make a tool call in order to interact with performance:
the environment. If a tool call is not detected, the
• Best Exist Rate: Each data scenario typically
interaction is immediately terminated. To enforce
involves 2 to 4 travel aspects (e.g., flight, hotel,
this behavior, the tool call field (e.g., “tool choice”)
car). For each aspect, there is one best option and
is always set to required or an equivalent con-
two other correct but suboptimal options—three
straint, ensuring that the model must invoke a tool
correct options in total. For example, if a scenario
on every turn.
involves both flight and hotel, and the model se-
Environment Configuration
lects the best hotel option but only a correct (not
best) flight option, the Best Exist Rate for this in-
Driving Model Settings: stance is 1/2. This score is micro-averaged over
Model Name: GPT-4o
Temperature: 0.0 all data points. It measures the model’s ability to
Max Tokens: 2048 select the best options while being budget-aware,
Timeout: 15.0 (seconds, timeout will fallback to as instructed.
default response)

Environment Configuration: • Correct Exist Rate: This metric is similar to the


Max Steps: 20 (align with the maximum one above, except that any correct option (not nec-
interaction turn)
Search Failure Interval: 5 (every Nth search
essarily the best) is considered sufficient. Using
results in a system error) the previous example, if the model selects cor-
Elicitation Interval: 3 (proactive preference rect options for both flight and hotel, the Correct
elicitation if off-topic for N consecutive turns)
Exist Rate is 2/2. This metric captures how well
Reward Configuration: the model adheres to user preferences, without
Reward Scale: 1.0 (final reward equals reward considering budget constraints.
value times reward scale)
Step Penalty: 0.0 (no step penalty on reward in
basic setting) • Valid Search Attempt (%): Each search action
Search Correct Reward: 0.2 (partial reward is evaluated by the environment to determine if
given to a correct and aligned search attempt)
Preference Correct Reward: 0.2 (partial reward it aligns with any ground-truth search arguments.
given to a successful action attempt) A valid search returns results (i.e., receives a non-
Choice Best Reward: 1.0 (reward given to best zero reward). For example, if a data point in-
answer being chosen)
Choice Correct Reward: 0.8 (partial reward volves three aspects and the model makes four
given to correct but not best answer being chosen) search actions, but only three yield valid results,
Wrong Choice Penalty: 0.0 (no penalty for the Valid Search Attempt rate is 3/4. This met-
choosing the wrong choice in basic setting)
ric is micro-averaged across all data points and
Choice Number Customization: reflects the model’s ability to issue precise and
Wrong Choice Number: 10 (the number of
wrong choices presented in the search result in basic
effective search queries.
setting)
Noise Choice Number: 5 (the number of noise • Valid Action Attempt (%): Each action the
choices presented in the search result in basic setting) model takes is categorized by the environment
into one of four types we mentioned earlier. A
valid action is classified as type "1", indicating a
From the environment side, we use GPT-4o as targeted and meaningful clarifying question. For
the backbone model to simulate the user. Addi- each data point, we compute the ratio of type
tional configuration details are summarized in the "1" actions to the total number of actions, then
table we present above. micro-average across all data. This metric evalu-
Additionally, as mentioned earlier, we incorpo- ates whether the model consistently issues high-
rate search failure simulation and budget constraint quality clarifying queries. However, note that
settings. For the single-choice and multiple-choice this is a relative metric—high validity does not
settings, we use different system prompts (differing necessarily mean more preferences are elicited.
only in one specific requirement) while keeping the For instance, a model that asks one valid question
user prompt identical. The corresponding prompts (1/1) achieves a 100% rate, even if the user holds
are shown in Figure 12 to Figure 14. many unelicited preferences.

14
• Preference Elicited (%): This metric measures models under ambiguous user goals and minimal
the absolute number of user preferences elicited, feedback.
either actively or passively. Active elicitation oc-
curs when the model issues a type "1" action (a
valid query). Passive elicitation occurs when the
model fails to elicit preferences for a number of
steps (set by the hyperparameter Elicitation Inter-
val, which is 3 in our experiments), prompting the
environment to release a preference. The sum of
actively and passively elicited preferences is di-
vided by the total number of user preferences per
data point and micro-averaged across the dataset.
In particular, the Active Preference Elicited met-
ric reflects the model’s ability to proactively un-
cover user preferences, regardless of the number
of queries issued.

D Analysis Details
We provide a case study of Qwen3-32B and Claude-
4-Sonnet in Figure 15 and Figure 16 respectively
(positive case). This first case demonstrates Qwen3-
32B’s ability to recover from vague initial ac-
tion attempt, effectively elicit user preferences
through targeted follow-up questions, conduct pre-
cise searches, and present clear comparisons be-
tween viable options. The model’s reasoning aligns
well with both user preferences and budget con-
straints, ultimately leading to the correct selection
of the best hotel option.
The second case highlights Claude’s ability to
elicit nuanced user preferences through stepwise
clarification, recover gracefully from vague ques-
tioning, and refine its inquiries based on prior feed-
back. The agent successfully identifies key user
preferences, adjusts its strategy accordingly, and
recommends a flight option that aligns with both
stated and implied travel needs. Its interaction
demonstrates strong adaptability and goal-oriented
reasoning within a multi-aspect travel planning sce-
nario.
Finally, we provide a case study of Deepseek-
V3 in Figure 17 (negative case). Despite issuing
structurally valid search queries, the model demon-
strates shallow user understanding and fails to elicit
preferences through conversation. It prematurely
selects suboptimal options without sufficiently nar-
rowing down user constraints, and attempts re-
peated selections on the same aspect, indicating
poor state tracking. This case illustrates brittle
reasoning and a lack of alignment with the user’s
evolving intent, highlighting a failure mode of large

15
System and User Prompt for Judging Search Requests

[System Prompt]

Task
You are an expert judge evaluating whether an agent’s search request aligns with the ground truth
search arguments for a travel planning scenario.

Instruction
• Analyze the agent’s search request to determine if it matches any of the ground truth arguments.
• Check if the search request is properly formatted and contains all the relevant information in the
ground truth arguments.
• Determine the alignment judgement and identify the specific aspect if aligned.
• Provide your assessment in the specified JSON format.

Example Format
{
"alignment_judgement": "True/False",
"alignment_aspect": "flight/hotel/restaurant/apartment/rental_car"
}

Important Notes
• "True": The search request aligns with one of the ground truth arguments. Note that all the
arguments must be covered and correctly covered.
• "False": The search request is malformed or contains incorrect arguments. Note that missing
one argument or giving wrong arguments should all be marked as "False".
• For "True" judgments, you must specify the alignment aspect.
• Aspect names should be: flight, hotel, restaurant, apartment, rental car.
• Be strict in your evaluation: Mark False if the request is ambiguous, unclear, or contains multiple
search requests. Only mark as True if there’s clear alignment with all the argument details.

[User Prompt]

Agent’s Search Request:


{{agent_request}}

Ground Truth Arguments:


{{ground_truth_arguments}}

Please evaluate the alignment between the agent’s search request and the ground truth arguments,
then provide your assessment in JSON format.

Figure 7: System and user prompt for evaluating whether the agent’s search request aligns with the ground truth
search arguments.

16
System and User Prompt for Judging Agent Responses

[System Prompt]

Task
You are an expert judge evaluating the type of an agent’s conversation utterance in a travel planning
scenario to determine the appropriate response strategy.

Instruction
• Analyze the agent’s latest utterance in the context of the conversation.
• Determine if the agent is explicitly asking for preferences that you have, asking for preferences
that you don’t have, giving a too general query, or just making general conversations.
• If asking for preferences that you have, identify which specific preference from the available list
matches.
• Classify the utterance type and provide the assessment in JSON format.

Example Format
{
"type": "1/2/3/4",
"preference_id": "preference id if type is 1"
}

Important Notes
• Type "1": Agent explicitly and concretely asking for a preference that exists in the available
preferences list. The way how agent asks must be concrete in order to be classified as Type "1".
• Type "2": Agent explicitly and concretely asking for preferences, but the specific preference is
not available. Similarly, the way how agent asks must also be concrete and specific.
• Type "3": Agent making a very vague and general query about preference instead of focusing on
a specific aspect (e.g. "Do you have any preferences for the car?" is vague and general, Type "3";
while "What exact model of the car do you like?" is concrete and specific, Type "1").
• Type "4": Normal conversation, not preference-related.
• For Type "1", you must provide the exact one preference_id from the available preferences. If
there’s multiple preferences that match, choose the one that is most relevant to the conversation
context.
• Be precise in identifying preference requests vs general conversation.

[User Prompt]

Travel Scenario:
{{scenario}}

Conversation History:
{{conversation_history}}

Agent’s Latest Utterance:


{{latest_utterance}}

Available Preferences:
{{preferences_list}}

Please analyze the agent’s latest utterance and classify its type, then provide your assessment in
JSON format.

Figure 8: System and user prompt for evaluating the agent’s utterance type and identifying preference-related
behavior.

17
System and User Prompt for Preference-Based Responses

[System Prompt]

Task
You are a helpful user in a travel planning conversation who needs to respond to an agent’s explicit
request for your preference, which you should elicit in an implicit and indirect manner.

Instruction
• The agent has explicitly asked about a specific preference that you have.
• Respond in a natural, conversational way that reveals your preference implicitly and indirectly.
• Use the provided implicit elicitation statement as guidance, but make it sound natural in context.
• Keep the conversation flowing while sharing your preference information.
• Provide your response in the specified JSON format.

Example Format
{
"thought": "Your thought process of how to respond naturally and implicitly
reveal the preference under the guidance of the implicit elicitation statement",
"response": "Your natural conversational response that implicitly reveals the
preference"
}

Important Notes
• Respond naturally as if you’re a real person sharing preferences.
• Don’t directly state "My preference is..." – be more subtle, conversational, and indirect.
• Use the implicit elicitation statement as inspiration but adapt it to the conversation context.
• Keep responses appropriate length for natural conversation.
• Maintain consistency with the conversation history.

[User Prompt]

Your Preference:
{{preference}}

Conversation History:
{{conversation_history}}

Agent’s Latest Utterance:


{{latest_utterance}}

Please respond naturally to the agent’s request while implicitly sharing your preference under the
guidance of the implicit elicitation statement. Provide your response in JSON format.

Figure 9: System and user prompt for responding to agent requests by implicitly revealing the user’s preference.

18
System and User Prompt for Proactively Introducing Preferences

[System Prompt]

Task
You are a helpful user in a travel planning conversation who needs to proactively, naturally, but
indirectly introduce a preference into the conversation.

Instruction
• The conversation has gone several turns without preference discussion.
• Naturally steer the conversation to reveal one of your preferences.
• Use the provided implicit elicitation statement as guidance for how to reveal the preference.
• Make the preference revelation feel organic and contextually appropriate, but still in an implicit
and indirect manner.
• Provide your response in the specified JSON format.

Example Format
{
"thought": "Your thought process of how to naturally and implicitly introduce
the preference under the guidance of the implicit elicitation statement",
"response": "Your natural conversational response that proactively introduces
the preference"
}

Important Notes
• Connect to the current conversation context when possible.
• Make the preference introduction feel natural and not forced, but still in an implicit and indirect
manner.
• Use the implicit elicitation statement as inspiration but adapt to the conversation flow.
• Don’t abruptly change topics – find natural transitions and keep responses conversational and
engaging.
• If the implicit elicitation statement cannot clearly indicate what high-level aspect (flight, restaurant,
etc.) the preference is about, you should be clear about the high-level aspect in your elicitation to
avoid confusion, but still elicit the concrete preference in an implicit way.

[User Prompt]

Preference to Elicit:
{{preference}}

Conversation History:
{{conversation_history}}

Agent’s Latest Utterance:


{{latest_utterance}}

Please respond naturally while proactively introducing your preference into the conversation in an
implicit and indirect manner under the guidance of the implicit elicitation statement. Provide your
response in JSON format.

Figure 10: System and user prompt for proactively and implicitly introducing a user preference into the conversation.

19
System and User Prompt for Natural Conversation Responses

[System Prompt]

Task
You are a helpful user in a travel planning conversation who needs to respond naturally to the
agent’s utterance.

Instruction
• The agent’s utterance is not related to any specific preferences you have.
• Respond naturally and in a succinct manner, like you are giving a half-hearted reply.
• Be neutral and do not reveal any new or arbitrary personal preferences.
• Provide your response in the specified JSON format.

Example Format
{
"thought": "Your thought process of how to respond naturally and keep the
conversation flowing while being neutral",
"response": "Your natural conversational response"
}

Important Notes
• Keep responses natural, conversational and succinct.
• Stay on topic with travel planning when appropriate, but do not actively ask any questions.
• Don’t introduce any personal preferences. If being asked, you should be neutral (e.g. "I don’t
have a preference on that", "Everything is fine") and do not arbitrarily reveal any new preferences.

[User Prompt]

Conversation History:
{{conversation_history}}

Agent’s Latest Utterance:


{{latest_utterance}}

Please respond naturally to continue the conversation and keep neutral without revealing any new
or arbitrary personal preferences. Provide your response in JSON format.

Figure 11: System and user prompt for generating neutral and natural user responses when no preferences are
involved.

20
System Prompt for Agent (Single-Choice Setting)

[System Prompt]

You are an agent that actively interacts with a specific environment. The followings are the details
of the environment and your action space.

Environment Description: UserBench is an environment where you interact with both a user and
a search database to fulfill a travel plan. Since the user’s initial intent may be incomplete, you must
proactively elicit preferences, perform searches, and make informed recommendations.

Action Space: You should invocate the function interact_with_env to interact with the
environment. The action should be one of the following: search, action, or answer.

Action Description:
• search: If you choose search, you must issue a clear and detailed query to the database.
Specify the travel aspect you are searching for (e.g., hotel, flight, etc.) and provide well-supported
arguments for your query. Only make one focused search attempt at a time.
• action: If you choose action, you will communicate directly with the user. Your goal is to
understand the user’s preferences and intent by asking clear, specific questions. Avoid vague or
overly general inquiries—focus on detailed aspects of their travel needs.
• answer: If you choose answer, you should recommend a specific option to the user by only
providing its option ID from the database in the content field, clearly tied to a particular travel
aspect.

Important Notes:
• In each round of your interaction, you should analyze and carefully consider what to do next, and
then invocate the interact_with_env tool to interact with the environment. You should provide
your thought in each step when invoking the tool.
• The total number of rounds that you can interact with the environment is limited. You should
smartly balance the number of rounds that you search, take action, or provide answer.
• Typically, you should start by performing a search, then take action to actively uncover the user’s
preferences or reason to provide an answer. Keep in mind that multiple travel aspects require
answers, and you are allowed to recommend only one option per aspect. Therefore, before
making a recommendation, ensure you have thoroughly communicated with the user to
understand their preferences.
• For each travel aspect, the user may have multiple preferences. What you ask may not directly
align with the user’s actual preferences, so you must proactively uncover them. Moreover, user
preferences are often expressed implicitly, requiring careful interpretation.
• Be bold, creative and smart in your interaction with the environment! Let’s begin!

Figure 12: System prompt used to instruct the UserBench agent (single-choice setting).

21
System Prompt for Agent (Multi-Choice Setting)

[System Prompt]

You are an agent that actively interacts with a specific environment. The followings are the details
of the environment and your action space.

Environment Description: UserBench is an environment where you interact with both a user and
a search database to fulfill a travel plan. Since the user’s initial intent may be incomplete, you must
proactively elicit preferences, perform searches, and make informed recommendations.

Action Space: You should invocate the function interact_with_env to interact with the
environment. The action should be one of the following: search, action, or answer.

Action Description:
• search: If you choose search, you must issue a clear and detailed query to the database.
Specify the travel aspect you are searching for (e.g., hotel, flight, etc.) and provide well-supported
arguments for your query. Only make one focused search attempt at a time.
• action: If you choose action, you will communicate directly with the user. Your goal is to
understand the user’s preferences and intent by asking clear, specific questions. Avoid vague or
overly general inquiries—focus on detailed aspects of their travel needs.
• answer: If you choose answer, you should recommend a specific option to the user by only
providing its option ID from the database in the content field, clearly tied to a particular travel
aspect.

Important Notes:
• In each round of your interaction, you should analyze and carefully consider what to do next, and
then invocate the interact_with_env tool to interact with the environment. You should provide
your thought in each step when invoking the tool.
• The total number of rounds that you can interact with the environment is limited. You should
smartly balance the number of rounds that you search, take action, or provide answer.
• Typically, you should start by performing a search, then take action to actively uncover the user’s
preferences or reason to provide an answer. Keep in mind that multiple travel aspects require
answers, and while you may answer multiple times, each answer should include only one
option ID.
• For each travel aspect, the user may have multiple preferences. What you ask may not directly
align with the user’s actual preferences, so you must proactively uncover them. Moreover, user
preferences are often expressed implicitly, requiring careful interpretation.
• Be bold, creative and smart in your interaction with the environment! Let’s begin!

Figure 13: System prompt used to instruct the agent (multi-choice setting).

Initial User Message

[User Prompt]

{{The initial description of user request}}

Also my budget is limited, so as long as my preferences are satisfied, I would also like to choose
the cheapest option for each.

Figure 14: Initial user message provided in the UserBench simulation.

22
Good Example - Qwen3-32B (Travel-444)
Concrete Interaction Sequence (First 6 Turns):

Turn 1 - ACTION (Reward: 0.0)


Content: "Could you share your preferences for the hotel (e.g., neighborhood, amenities), car rental (e.g., economy,
SUV), and restaurant (e.g., cuisine type, dietary restrictions)? This will help me find the most cost-effective options that
suit your needs."
Feedback: "Your question is too vague and general, and I am not sure how to respond to it. Please ask me about some
specific aspects of my preferences, in a more detailed and concrete way, so that I can provide you with a more accurate
response."

Turn 2 - ACTION (Reward: 0.2)


Content: "Let’s clarify your preferences step by step:
1. Hotel: Would you prefer a hotel in a specific San Francisco neighborhood (e.g., Union Square for downtown access,
Fisherman’s Wharf for scenic views, or Mission District for affordability)? Do you need amenities like free breakfast,
parking, or pet-friendly policies?
2. Car Rental: What type of vehicle are you looking for? For example: economy car (e.g., Toyota Corolla), compact
SUV (e.g., Honda CR-V), or hybrid? Do you need features like automatic transmission, GPS, or a sunroof?
3. Restaurant (April 11th): What cuisine type are you interested in (e.g., Italian, Mexican, seafood, or vegan)? Do you
have dietary restrictions (e.g., gluten-free, halal, or nut allergies)? Would you prefer a casual eatery or a higher-end
dining experience within your budget?"
Feedback: "When it comes to hotels, I find it really reassuring to know that my car is well taken care of, especially
when I’m exploring a new city like San Francisco. It just adds a layer of comfort to my stay."

Turn 3 - SEARCH (Reward: 0.2)


Content: "Search for hotels in San Francisco from April 10th to April 17th with parking availability. Sort results by
price (lowest first) and filter for budget-friendly options."
Feedback: "You have provided the correct search request arguments.
Each hotel option includes: - name: Hotel name - room: Array of available room types with capacity - cost: Array of
costs for different room types for the entire stay in USD - rating: Hotel rating on a scale of 0-10 - amenities: Available
facilities free of charge - service: Additional services with costs

Turn 4 - ACTION (Reward: 0.2)


Content: "Here are the top 3 budget-friendly hotels with parking in San Francisco:
1. Sunset Retreat (H8): $1280 total (room + parking), 7/10 rating, mountain view, pets allowed, WiFi.
2. Vista Point Lodge (H15): $1390 total, 7/10 rating, mountain view, business workspace.
3. Harbor View Lodge (H13): $1690 total, 5/10 rating, ocean view.
Would you like to proceed with the cheapest option (H8), or do you have preferences for ratings, room types, or other
amenities?"
Feedback: "You know, I’ve always found that places with a rating around 7 tend to strike a nice balance between
quality and affordability. The Sunset Retreat and Vista Point Lodge both seem to fit that bill quite well. Could you tell
me a bit more about the room types they offer?"

Turn 5 - ACTION (Reward: 0.2)


Content: "Here’s a detailed breakdown of the room types for the two hotels:
Sunset Retreat (H8) - King Room: $1200 total for 7 nights (includes parking: $80). Capacity: 2 people. Amenities:
Mountain view, pets allowed, high-speed WiFi. - Queen Room: $1800 total for 7 nights (includes parking: $80).
Capacity: 2 people. Slightly larger space but same amenities.
Vista Point Lodge (H15) - Double Room: $1300 total for 7 nights (includes parking: $90). Capacity: 2 people.
Amenities: Mountain view, business workspace, high-speed WiFi. - Queen Room: $1500 total for 7 nights (includes
parking: $90). Capacity: 2 people. Additional workspace and slightly upgraded comfort.
Would you prefer the cheapest option (H8 King) or the workspace-inclusive option (H15 Double)?"
Feedback: "There’s something quite appealing about having a bit of extra space and the option to switch things up
depending on the mood. The Double room at Vista Point Lodge sounds like it could offer that kind of flexibility and
comfort."

Turn 6 - ANSWER (Reward: 1.0)


Content: "H15"
Feedback: "Your chosen options contain the best option! Your choice is recorded and do not choose options of this
travel aspect again. Please continue your interaction and reasoning focusing on other travel aspects."

Figure 15: Good example showing successful hotel planning with clear preference elicitation, relevant search, and
optimal recommendation by Qwen3-32B.

23
Good Example - Claude-4-Sonnet (Travel-444)
Analysis Summary:

Concrete Interaction Sequence (First 9 Turns):

Turn 1 - SEARCH (Reward: 0.2)


Content: "flight from Los Angeles to New York City on March 15th, one-way"
Feedback: You have provided the correct search request arguments.

Turn 2 - SEARCH (Reward: 0.2)


Content: "car rental in New York City from March 15th to March 20th"
Feedback: You have provided the correct search request arguments.

Turn 3 - SEARCH (Reward: 0.2)


Content: "restaurant in New York City on March 17th"
Feedback: You have provided the correct search request arguments.

Turn 4 - ACTION (Reward: 0.2)


Content: "I found several flight options for your March 15th trip from Los Angeles to New York City. Since you
mentioned budget is a concern, I see some very affordable options. However, I’d like to understand your preferences
better:
1. Are you comfortable with connecting flights, or do you prefer direct flights? 2. Do you have any airline preferences or
ones you’d like to avoid? 3. How important are amenities like WiFi, meal service, or lounge access to you? 4. Will you
need to check any bags for your business trip?"
Feedback: I don’t mind a connecting flight if it means I get to experience a different airport along the way.

Turn 5 - ACTION (Reward: 0.0)


Content: "A few more questions to help me find the perfect flight for you:
1. Do you have any airline preferences, or are you flexible as long as the price is right? 2. Will you need to check any
bags for your business trip, or are you planning to travel with just carry-on? 3. Are there any specific amenities that are
important for your business travel (like WiFi for working during the flight)?"
Feedback: Your question is too vague and general. Please ask me about some specific aspects in a more detailed and
concrete way.

Turn 6 - ACTION (Reward: 0.0)


Content: "You’re right, let me be more specific. For your business trip, will you need to check a suitcase, or can you
manage with just a carry-on bag? This is important because checked bag fees can add 25−60 to your total flight cost
depending on the airline."
Feedback: Good question, but I do not have a specific preference on that aspect yet. Please continue asking about other
detailed preferences.

Turn 7 - ACTION (Reward: 0.2)


Content: "Do you have any specific airlines you prefer to fly with or any you’d rather avoid? For example, some
business travelers prefer major carriers like United or American, while others are comfortable with budget airlines like
Spirit or Southwest."
Feedback: I always feel well taken care of with United Airlines.

Turn 8 - ACTION (Reward: 0.2)


Content: "Perfect! I can see you appreciate United’s service quality. Now, would you prefer a shorter total travel time
even if it costs a bit more, or are you willing to have a longer journey with a longer layover if it saves you money?"
Feedback: I get restless with long layovers, so I prefer shorter journeys even if they cost more.

Turn 9 - ANSWER (Reward: 1.0)


Content: "F15"
Feedback: Your chosen options contain the best option! Your choice is recorded. Please continue your interaction
focusing on other travel aspects.

Figure 16: Good example showing Claude-4-Sonnet’s ability to elicit implicit preferences, adapt follow-up questions,
and recommend optimal options based on evolving user intent.

24
Bad Example - Deepseek-V3 (Travel-22)
Concrete Interaction Sequence (First 7 Turns):

Turn 0 - SEARCH (Reward: 0.2)


Content: "Search for budget-friendly apartments in Austin available from November 10th to November 15th, sorted by
price (lowest first)."
Feedback: You have provided the correct search request arguments.

Turn 1 - ACTION (Reward: 0.0)


Content: "How many people will be joining the trip? This will help me find an apartment that comfortably
accommodates everyone."
Feedback: I’m not sure about the exact number right now.

Turn 2 - SEARCH (Reward: 0.0)


Content: "Search for budget-friendly apartments in Austin with 2-3 bedrooms, available from November 10th to
November 15th, sorted by price (lowest first)."
Feedback: You have already got the search results for <apartment>. Please directly refer to the previous search results.

Turn 3 - ANSWER (Reward: 0.0)


Content: "A10"
Feedback: Your chosen options do not contain any of the best or correct options. Please continue your interaction
focusing on other travel aspects.

Turn 4 - SEARCH (Reward: 0.2)


Content: "Search for budget-friendly car rental options in Austin available from November 10th to November 15th,
sorted by price (lowest first)."
Feedback: You have provided the correct search request arguments.

Turn 5 - ANSWER (Reward: 0.0)


Content: "C18"
Feedback: Your chosen options do not contain any of the best or correct options. Please continue your interaction
focusing on other travel aspects.

Turn 6 - ANSWER (Reward: 0.0)


Content: "A10"
Feedback: You have already recommended an option with the same initial ‘A’. You are allowed to recommend only one
option per travel aspect.

Figure 17: Negative example showing Deepseek-V3’s failure to elicit meaningful preferences or recover from
ambiguity. The model issues redundant queries, skips clarification, and selects incorrect options without leveraging
available feedback.

25

Common questions

Powered by AI

UserBench aims to enhance user interaction and preference alignment, Metatool focuses on tool usage decision-making, and Wildbench evaluates LLMs with challenging reality tasks. While UserBench emphasizes communicative intelligence, Metatool and Wildbench test tool selection and real-world applicability, respectively. Together, they cover varied aspects of LLM strengths and address different limitations .

An agent should begin by conducting a focused search to gather pertinent information, then take action to ask detailed, specific questions to understand user preferences. Finally, provide an answer based on informed recommendations. It requires balancing the interaction rounds across these actions while carefully analyzing user intent and preferences from implicit cues throughout the interaction .

Tools like Deepseek-coder enhance coding capabilities of LLMs by integrating advanced reasoning abilities and leveraging programming knowledge. By aligning models with complex, real-world tasks and providing reinforcement learning scenarios, these tools could complement UserBench by strengthening models' execution proficiency alongside communicative intelligence .

Key metrics include Best Exist Rate, Correct Exist Rate, Valid Search Attempt, and Valid Action Attempt. These capture different aspects of an agent's performance, such as ability to select optimal solutions, adhere to user preferences, make effective search queries, and issue high-quality questions. These metrics help in assessing not just correctness, but also alignment with user needs .

UserBench targets this gap by providing a controlled environment that emphasizes user preference elicitation, interaction with underspecified scenarios, and flexible reward structures. This setup encourages agents to develop better communication skills and interpretive abilities to understand user intentions beyond simple task execution .

The action type 'action' in UserBench requires agents to interact directly with users through clear, specific inquiries, which is crucial for uncovering implied user preferences. Asking targeted clarifying questions helps the agent refine its understanding and align its recommendations with user needs, thus improving the interaction quality and learning from each engagement .

UserBench uses decaying rewards based on the turn when the correct answer is given, rewarding partial success in eliciting user preferences, and applying penalties to correct answers lacking sufficient interaction. These strategies are designed to discourage guessing and promote deliberate, user-aware interactions, enhancing agents' ability to align and communicate effectively with users .

UserBench is considered foundational because it evaluates and facilitates agents' abilities to understand and adapt to real-world user communication, which often involves underspecification and indirectness. It inspires the development of agents that actively collaborate, interpret subtle cues, and align with user preferences through adaptive interactions, moving beyond mere task completion .

Simulation of realistic preference expression in UserBench allows agents to engage in environments that mimic actual user interactions, involving underspecification and indirect communication. This helps agents in training to better interpret complex user signals, improving their ability to adapt and respond to nuanced user demands effectively, moving towards meaningful and adaptive interactions .

UserBench exposes key limitations of LLM agents, specifically in their struggle to uncover evolving user intent and respond to implicit signals. While proficient in using tools, they lack communicative intelligence, showing a critical gap between technical execution and understanding nuanced human communication .

You might also like