0% found this document useful (0 votes)
36 views31 pages

2025.4 (NAACL) - Towards Lifelong Dialogue Agents Via Timeline-Based Memory Management

Uploaded by

wzy19129210150
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views31 pages

2025.4 (NAACL) - Towards Lifelong Dialogue Agents Via Timeline-Based Memory Management

Uploaded by

wzy19129210150
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Towards Lifelong Dialogue Agents via Timeline-based

Memory Management

Kai Tzu-iunn Ong1 * Namyoung Kim1∗ Minju Gwak1 Hyungjoo Chae1


Taeyoon Kwon1 Yohan Jo2 Seung-won Hwang2 Dongha Lee1 Jinyoung Yeo1
1
Yonsei University, 2 Seoul National University
{ktio89, namyoung.kim, jinyeo}@yonsei.ac.kr

Abstract Timeline of Relevant Memories (personal interests)

Session 2 Mis-deleted by conventional memory update Session 5


To achieve lifelong human-agent interaction, di- X . is afraid of cruise ships ... . likes to program
due to not being able to swim. in html and css.
alogue agents need to constantly memorize per-
ceived information and properly retrieve it for
response generation (RG). While prior studies start present

Session 1 Session 4
focus on getting rid of outdated memories to im- . enjoys cruising in the is interested in learning
prove retrieval quality, we argue that such mem- car while listening to how to swim and program,
punk music ... already practicing programing.
ories provide rich, important contextual cues
for RG (e.g., changes in user behaviors) in long- Current Dialogue Context Retrieved

term conversations. We present T HEANINE, Hey. It’s been a while! ...


a framework for LLM-based lifelong dialogue
...... I heard you've been practicing swimming
agents. T HEANINE discards memory removal for the cruise trip. I remember you used to be
and manages large-scale memories by link- afraid of anything related to water.

ing them based on their temporal and cause-


(a) Memory-augmented RG with memory update:
effect relation. Enabled by this linking struc-
No, I was always eager to learn how to swim and go
ture, T HEANINE augments RG with mem- on a cruise. I'm excited to join you on this adventure X
ory timelines - series of memories represent- and also learn some programming from you.

ing the evolution or causality of relevant past (b) Entire dialogue history as model input:
events. Along with T HEANINE, we introduce Nope, I've never been afraid. Do you have any fears
TeaFarm, a counterfactual-driven evaluation that prevent you from enjoying certain activities? X
scheme, addressing the limitation of G-Eval
(c) Timeline-augmented response generation (ours):
and human efforts when assessing agent perfor- + +
+
+ + Not being able to swim really took the fun out of my
mance in integrating past memories into RG. A + +
+

+ life, but after all that practice, I'm now ready to


supplementary video for T HEANINE and data enjoy the water.

for TeaFarm are at https://huggingface.


co/spaces/ResearcherScholar/Theanine. Figure 1: Empirical examples of failed responses due to
(a) absence of an important past event (“afraid of cruise
1 Introduction ships”) on the timeline and (b) bias to the latest input.
(c) is a response augmented with the memory timeline.
Autonomous agents based on large language mod-
els (LLMs) have made significant progress in vari-
ous domains, including response generation (Chae trieve them to augment response generation (RG)
et al., 2024; Kwon et al., 2024; Tseng et al., 2024), in later encounters (Xu et al., 2022a; Lu et al.,
where agents ought to constantly keep track of both 2023). However, the growing span of memories
old and newly introduced information shared with can hinder retrieval quality as conversations accu-
users throughout their service lives (Irfan et al., mulate. Although it, to some extent, can be solved
2024) and converse accordingly. To facilitate such by updating old memories (Bae et al., 2022; Zhong
lifelong interaction, studies have proposed enhanc- et al., 2024), such common practice may cause
ing dialogue agents’ ability to memorize and accu- severe information loss. As shown in Figure 1
rately recall past information (e.g., discussed top- (a), an earlier memory on the timeline, an impor-
ics) in long-term, multi-session conversations. tant persona (“afraid of ships”), is removed during
A representative approach is to compress past memory update, resulting in improper RG. While
conversations into summarized memories and re- using the large context windows of recent LLMs
* KT Ong and N Kim are the co-first authors. to process all dialogue history/memories is an op-
8631
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies
(Volume 1: Long Papers), pages 8631–8661
April 29 - May 4, 2025 ©2025 Association for Computational Linguistics
Phase I: Memory Graph Construction Phase II: Timeline Retrieval and Refinement
Current Dialogue Context D (Session t+1)
Old Recent retrieve Hey, remember that time you yelled at
“A yelled at B in “A moved to a place close to his me in the hallway? That was really funny.
the hallway ......” son’s school last week.”
Phase II-1: Retrieving and untangling raw timelines
Session 1 Session 2 Session 3 Session 4
A yelled at B... stressed... A moved... ...time with her son.

n Reason Want Cause Reason


Reaso
Phase II-2: Context-aware timeline refinement
“A felt stressed because his son link refine
kept complaining about living alone.
B might want to know why A yielded at him. Based on ..., A was
feeling stressed because ... Finally, A decided to quit her job and
New Memories move to NYC. Now, A enjoys every second spent with her son.
“A enjoys spending
time with her son” Phase III: Timeline-augmented Response Generation
summarize
am sorry. I was feeling stressed because my son
...

I
Previous Dialogue
“B likes spicy food wanted us to live together, but ...... Now, we go for a
Session t but ...... walk on Manhattan bridge every night.

Figure 2: The overview of T HEANINE. Left: Linking new memories to the memory graph after finishing a
dialogue session; Right: Memory timeline retrieval, refinement, and response generation in a new dialogue session.

tion to prevent such information loss,1 this often and cause-effect commonsense relations (Hwang
leads to biased attention toward the latest user input et al., 2021). Supported by such linking structure,
(Figure 1 (b)), ignoring relevant contexts from the in memory retrieval for RG (Phase II-1), we go
past (Liu et al., 2024). These findings highlight two beyond conventional top-k retrieval and further ob-
main challenges towards lifelong dialogue agents - tain the complete timelines to avoid missing out on
(i) Memory construction: how to store large-scale important memories that have low textual overlap
past interactions effectively without removing old with current conversation (Tao et al., 2023). Lastly,
memories? (ii) Response generation: within the to tackle the discrepancy between off-line memory
growing memory span, how to identify relevant construction and online deployment, T HEANINE
contextual cues for generating proper responses? uses an LLM to refine retrieved timelines (Phase
Motivated by these, we propose addressing the II-2) based on current conversation, such that they
above two challenges separately yet complemen- provide tailored information (Chae et al., 2023) for
tarily, by (i) discarding memory update to avoid RG (Phase III). Our contributions are two-fold:
information loss, and preserving relevant memo-
• To achieve lifelong dialogue agents, we
ries on the timeline in a linked structure; and (ii)
present T HEANINE, an LLM-based frame-
retrieving the timeline as a whole to better catch
work with a relation-aware memory graph
relevant memories within the growing search span.
and timeline augmentation for long-term
We present T HEANINE,2 a framework for facil-
conversations. T HEANINE outperforms rep-
itating lifelong dialogue agents.
resentative baselines across automatic, LLM-
Starting from memory construction (Phase I), in- based, and human evaluations of RG. Also,
stead of stacking raw memory sentences as-is (Xu we confirm that T HEANINE leads to higher
et al., 2022a), which may affect memory retrieval retrieval quality, and its procedures align with
and also response quality due to the unstructured human preference. To our knowledge, we are
format of information (Mousavi et al., 2023; Chen the first to model the use of timelines (i.e.,
et al., 2023), T HEANINE stores memories in a di- linked relevant memories) in memory man-
rected graph. In this graph, inspired by how hu- agement and response generation.
mans naturally link new memories to existing ones
of relevant events based on their relation (Bartlett, • The lack of golden mapping between conver-
1995), memories are linked using their temporal sations and reference memories poses a chal-
lenge in assessing memory-augmented agents.
1
For instance, GPT-4o and Llama 3.1 have context win- We present TeaFarm, a counterfactual-driven
dows of 128K tokens (OpenAI, 2024a; MetaAI, 2024).
2
L-theanine is an amino acid found in green tea that has pipeline evaluating agent performance in ref-
been linked to memory improvement (Nguyen et al., 2019). erencing the past without human intervention.
8632
2 Methodologies adopt a relation-aware memory linking, where an
edge between two memories is encoded with their
We present T HEANINE, a framework for lifelong
cause-effect commonsense relation r ∈ R, along
dialogue agents inspired by how humans store and
w/ the temporal order. In practice, we adopt the
retrieve memories for conversations (Figure 2):
commonly used relations defined by Hwang et al.
2.1 Memory Graph Construction (Phase I) (2021), including HinderedBy, Cause, Want, and
4 more (Appendix B.1).
To manage large-scale memories and facilitate
We start by determining the relation between
structured information for RG (Mousavi et al.,
mnew and each associative memory. Formally, for
2023; Chen et al., 2023), we approach memory
each pair of mnew and m ∈ Ma , the LLM assigns
management using a memory graph G:
a relation r ∈ R based on their event, time and their
G = (V, E) (1) origin conversations:
V = {m1 , m2 , ..., m|V | } (2)
Ma∗ = {mi ∈ Ma | Υ(mi , mnew ) ∈ R} (6)
m = (event, time) (3)
E = {⟨mi , rij , mj ⟩|mi , mj ∈ V ∧ rij ∈ R} (4) where Υ(·, mnew ) ∈ R indicates that the given
R = {Cause, Reason, Want, ..., SameTopic} (5) memory is assigned with an r ∈ R with mnew ,4
and such assigned memories are defined as Ma∗ .
In G, vertices V are memories m summarized from We then proceed to link mnew to the graph. We
the conversations. Each memory m = (event, time) first locate every connected component Ci ⊂ Gt
consists of an event3 and the time it is formed (sum- that contains at least one m ∈ Ma∗ , as shown in
marized). Each directed edge e ∈ E between two Figure 3 (a) and (b):
connected m indicates their temporal order and
their cause-effect commonsense relation r ∈ R: C = {Ci ⊂ Gt | V(Ci ) ∩ Ma∗ ̸= ∅ } (7)
At the end of dialogue session t, T HEANINE
starts linking each new memory mnew summarized where C is the collection of those C and V(·) rep-
from session t to the memory graph Gt . resents “vertices in”. Then, we link mnew to the
most recent5 m ∈ Ma∗ in each Ci ⊂ C (Figure 3
Phase I-1: Identifying associative memories for (c)). Memories Mlinked that are linked to mnew is
memory linking. Following how humans link defined as follows:
new memories to existing ones that are related to a
similar event/topic, i.e., the associative memories, Mlinked = {Ω(V(Ci ) ∩ Ma∗ ) | Ci ⊂ C} (8)
T HEANINE starts by identifying these associative
memories from the memory graph Gt . where Ω(·) indicates “the most recent memory in”.
Formally, given a newly-formed memory mnew
waiting to be stored, the associative memories Ma (a) (b) (c)

of mnew is defined as the set of mi ∈ Gt having Old Recent Old Recent Old Recent

top-j text similarity with mnew (i.e., |Ma | = j).


Phase I-2: Relation-aware memory linking. In-
tuitively, we can link mnew to m ∈ Ma using edges
that indicate their text similarity and chronologi- Associative memories with a commonsense relation:

Connected components that contains at least one :


cal order, we find such simplified connection (e.g., The most recent in that will be linked to :

“this happened → that similar event occurred”) can


yield a context-poor graph that does not help re- Figure 3: Locating memories to be linked to mnew .
sponse generation much (Section 4).
Humans, on the other hand, interpret events by Linking all memories from session t to Gt , we
considering the relation between them, such as then obtain a new memory graph Gt+1 . The pseudo
“how does an event affect the other?” or “why algorithm for Phase I is in Algorithm 1.
did this person make that change?”. Therefore, we 4
Limited by retrievers, an m ∈ Ma may not have a relation
3 with mnew . We thus allow the LLM to output “None”.
In this work, “event” denotes information perceived by
5
the dialogue system, including things done/said by speakers Simply linking mnew to all m ∈ Ma∗ costing 25% more
and the acknowledgement of speaker personas. API cost for linking without leading to better response.

8633
2.2 Timeline Retrieval and Timeline We then sample n raw timelines τ from T .6 Repeat-
Refinement (Phase II) ing7 Phase II-1 for all retrieved top-k memories,
Thanks to the constructed memory graph, T HEA - we collect a set of retrieved raw memory timelines
NINE can proceed to augment RG with timelines T = ∪ T , where |T| = k ∗ n.
of relevant events, addressing the information loss Phase II-2: Context-aware timeline refinement.
in conventional memory management (Figure 1). Although we have constructed the memory graph
With Gt+1 , T HEANINE performs the following using temporal and commonsense relations to im-
steps for RG in session t + 1: prove its informativeness, directly applying re-
Preparation: Top-k memory retrieval. During trieved timelines for RG can be suboptimal (RQ3,
the conversation, using the current dialogue con- Section 4), because graph construction does not
text D = {ui }ni=1 of n utterances u as query, we take current conversation into consideration, i.e.,
retrieve top-k memories Mre = {mre1 , ..., mrek }. they are constructed off-line.
In this phase, T HEANINE tackles such a discrep-
Phase II-1: Retrieving and untangling raw mem- ancy between off-line memory construction and
ory timelines. We wish to also access memories online deployment (i.e., ongoing conversation) via
centered around Mre . Formally, given mre ∈ Mre , a context-aware timeline refinement. Motivated
we further collect the connected component Cre ⊂ by how LLMs can refine their previous genera-
Gt+1 that contains mre via the linked structure. tion (Madaan et al., 2024). We leverage LLMs
Since this collection of memories (i.e., Cre ) can to refine raw timelines into a rich resource of in-
be “tangled up” together (i.e., connected in a com- formation crafted for the current conversation, by
plex manner) due to the graph structure, we proceed removing redundant information or highlighting in-
to untangle it into several memory timelines, each formation that can come in handy. Formally, given
representing a series of events about mre that starts the current dialogue D and retrieved raw timelines
out similarly yet branches into slightly different T, an LLM tailors τ ∈ T into refined timelines TΦ :
development. For that, we first locate the earliest
memory in Cre as a starting point mstart for all TΦ = {argmax PLLM (τΦ |D, τ ) | τ ∈ T} (11)
τΦ
timelines, as shown in Figure 4 (left).
All refined timelines TΦ are then used to augment
mstart = Θ(V(Cre )) (9)
the response generation. We provide the pseudo
where Θ indicates “the oldest memory in”. algorithm for Phase II in Algorithm 2.

Old Recent Old Recent 2.3 Timeline-augmented Response


Generation (Phase III)
Now, T HEANINE utilizes the refined timelines for
RG. Formally, given D = {ui }ni=1 and TΦ , an
Retrieved Connected Component Extracted Raw Memory Timelines LLM generates a next response ūt+1 :
Figure 4: Extracting raw memory timelines τ from the ūn+1 = argmax PLLM (un+1 |D, TΦ ) (12)
retrieved connected component Cre . un+1

Next, starting from mstart , we untangle memo- 3 Experimental Setups


ries by tracing through future direction and extract 3.1 Datasets of Long-term Conversations
every possible linear graph containing mre (two
There are limited datasets for long-term, multi-
in Figure 4) from Cre , until reaching an endpoint
session conversations. Firstly, Multi-Session Chat
τ [−1] with an out-degree of 0 (i.e., deg + (τ [−1]) =
(MSC) (Xu et al., 2022a), is built upon Persona-
0), which means no directed edge goes out from it).
Chat (Zhang et al., 2018) by extending its conver-
Each of them is considered a raw memory timeline
sations to multiple (five) sessions. Soon after MSC,
τ , demonstrating a version of the evolution of mre
6
and its relevant events: We empirically set n to 1, as we observe a high degree of
overlap across timelines extracted from the same Cre , which
T = {τ ⊂ Cre | τ is a directed linear can lead to redundant information (i.e., input tokens) for RG.
7
“Repeating” is used to explain the algorithm from the per-
graph s.t. mstart , mre ∈ τ (10) spective of one mre . In practice, Mre are processed together,
+ although processing them 1-by-1 yields the same result.
∧ deg (τ [−1]) = 0}
8634
Datasets: Multi-session Chat (MSC) Conversation Chronicles (CC)
Methods / Metrics Bleu-4 Rouge-L Mauve BertScore Bleu-4 Rouge-L Mauve BertScore
All Dialogue History 1.65 14.89 9.06 86.28 4.90 21.56 26.47 88.13
All Memories & Current Context D 1.56 14.89 10.62 86.23 4.41 20.06 38.16 88.02
+ Memory Update (Bae et al., 2022) 1.55 14.77 9.28 86.20 4.34 20.34 34.84 88.03
Memory Retrieval (Xu et al., 2022a) 1.92 15.49 11.16 86.47 4.93 20.63 33.06 88.07
+ Memory Update (Bae et al., 2022) 1.67 15.30 13.71 86.39 4.46 20.19 34.28 88.02
Rsum-LLM (Wang et al., 2023) 0.75 11.53 2.45 84.91 0.98 11.42 2.28 85.59
MemoChat (Lu et al., 2023) 1.42 13.51 7.72 85.96 2.31 15.87 15.12 87.08
COMEDY (Chen et al., 2024b) 1.06 12.79 7.27 85.29 1.70 13.57 1.95 85.90
T HEANINE (Ours) 1.80 15.37 18.62 86.70 6.85 22.68 64.41 88.58

Table 1: Automatic evaluation of response quality (average of sessions).

DuLeMon (Xu et al., 2022b) and CareCall (Bae compresses all of them into short events, user por-
et al., 2022) are proposed for long-term conversa- traits (behavioral patterns, emotion, etc.) and user-
tions in Mandarin and Korean. Recently, Jang et al. bot relation. It then selects compressed memories
(2023) release a new dataset, Conversation Chron- to augment response generation.
icles (CC). Unlike MSC, CC augments speakers
with defined relationships, such as “employee and 3.3 Models and Implementation Details
boss”. Apart from these open-domain datasets, the Large language models. In all experiments,
Psychological QA,8 addresses long-term conversa- including baselines, we adopt gpt-3.5-turbo-
tions under clinical scenarios in Mandarin. 0125 (OpenAI, 2023) for (i) memory summariza-
We opt for MSC and CC for evaluation to fo- tion (Table 6), (ii) memory update, and (iii) re-
cus on English conversations, leaving multilingual sponse generation. Temperature is set to 0.75.
and domain-specific conversations (e.g., DuleMon, Retrievers. We use text-embedding-3-small (Ope-
CareCall, and Psychological QA) to future work. nAI, 2024b) to calculate text similarity for settings
involving retrievers. In the identification of top-j
3.2 Baselines
associative memories (Phase I-1) and top-k mem-
To evaluate T HEANINE, in addition to naive base- ory retrieval (Phase II), we set j and k to 3. For
lines that utilize all past dialogues or memories, we the “Memory Retrieval” baseline, we set k = 6
incorporate the following settings: following Xu et al. (2022a).
Memory Retrieval. Following Xu et al. (2022a), Dialogue sessions. We use sessions 3-5 of MSC
we use a retriever to retrieve memories relevant to and CC for evaluations, as all methods are almost
the current dialogue context to augment RG. identical in session 1 ∼ 2 (no memory to update).
Memory Update. We utilize LLMs to implement
a widely used updating algorithm proposed by 4 Evaluation Scheme 1: Automatic and
Bae et al. (2022) at the end of each dialogue ses- Human Evaluations
sion. This algorithm includes functionalities such
as Change, Replace, Delete, Append, and more To evaluate T HEANINE’s responses in long-term
(see Appendix H). conversations, we follow common practices and
RSum-LLM. An LLM-only generative method conduct 3 types of evaluations: (i) Automatic evalu-
that recursively summarizes and updates the mem- ations; (ii) G-Eval (Liu et al., 2023), an LLM-based
ory pool, generating responses w/o a retrieval mod- framework commonly used to evaluate LMs’ gener-
ule (Wang et al., 2023). ation; (iii) human evaluation. We now present sev-
MemoChat. Proposed by Lu et al. (2023), it lever- eral key findings (details, prompts, and interfaces
ages LLMs’ CoT reasoning ability to (i) conclude of evaluations in Scheme 1 are in Appendix E):
important memories from past conversations in a
(Finding 1) T HEANINE outperforms baselines in
structured topic-summary-dialogue manner, (ii) se-
response generation. Table 1 presents the agent
lect memories, and (ii) generate responses.
performance in RG regarding both overlap-based
COMEDY. Proposed by Chen et al. (2024b), it
and embedding-based metrics: Bleu-4 (Papineni
uses LLMs to summarize session-level memories,
et al., 2002), Rouge-L (Lin, 2004), Mauve (Pillutla
8
https://www.xinli001.com/ et al., 2021), and BertScore (Zhang et al., 2020).
8635
Settings / Metrics B-4 R-L Mauve Bert
Helpfulness of Retrieved Memories (given same context)

Baselines: G-Eval Human Evaluation

T HEANINE (Ours) 4.32 19.03 41.52 87.64 Memory Retrieval 52.0% 48.0% 48.3% 40.4% 11.4%

+Memory Update 56.0% 2.0% 42.0% 48.7% 30.8% 20.5%


w/o Relation-aware Linking 4.07 18.58 39.69 87.57 N / A
- -
N / A
-
RSum -
w/o Timeline Refinement 4.03 18.82 41.34 87.66 MemoChat 54.2% 45.8% 50.3% 34.0% 15.6%

Broken Down, Shuffled Timeline 4.15 18.70 38.49 87.61 COMEDY 61.9% 38.1% 44.1% 41.4% 14.4%

Memory Retrieval 3.43 18.06 22.11 87.27 Legends: THEANINE wins Tie Baseline wins

Table 2: Performance of our ablations (avg. of datasets). Figure 5: Human- (right) and machine-based (left) head-
to-head comparisons between ours and baselines regard-
ing the helpfulness of retrieved memories.
Across both datasets, T HEANINE, achieves superior
response quality than various baselines. Although,
compared to Memory Retrieval, T HEANINE scores rent dialogues as queries for memory retrieval,
slightly lower in overlap-based metrics (i.e., B-4 Figure 5 shows head-to-head comparisons (ours
and R-L) in MSC, it largely outperforms Memory vs. baselines) regarding whose retrieved memories
Retrieval in embedding-based metrics. Interest- more effectively benefit RG. We observe higher
ingly, including ours, methods without memory win rates for T HEANINE in all comparisons, es-
update generally yield higher scores, justifying our pecially in human evaluations. This suggests that
proposal towards an update-, removal-free memory our method can facilitate more helpful memory
management for lifelong dialogue agents. augmentation for response generation.
In addition to helpfulness, objectively measuring
(Finding 2 & 3) All phases contribute to perfor- retrieval accuracy is crucial. Since existing datasets
mance; retrieving the timeline as a whole brings of long-term conversations do not provide a golden
large improvement over conventional retrieval. mapping between dialogue contexts and memories
To gain deeper insights into our design, we investi- (i.e., golden memories for retrieval), we identify
gate the impact of removing T HEANINE’s relation- 50 dialogue contexts (i.e., test instances) that re-
awareness during memory linking (Phase I-2) and quire a past memory for RG, and manually measure
Timeline Refinement (Phase II-2). Also, to objec- the retrieval accuracy of different agents. The re-
tively assess whether T HEANINE’s retrieval (i.e., sults shown in Table 3 indicate that T HEANINE and
retrieving the timeline as a whole) improves re- its ablations demonstrate higher retrieval accuracy
trieval quality, we include a setting where retrieved than baselines, and the ranking here aligns with
timelines are broken down into randomly ordered Table 1 and success rates in Table 4.
events such that retrieved memories during RG are
Methods (Agents) Golden Memory is Retrieved/collected (%)
in the same format as conventional top-k retrieval.
Memory Retrieval 68.00
In Table 2, we observe a ranking in terms of con- + Memory Update 64.00
tribution to performance: relation-aware linking MemoChat 56.00
COMEDY 48.00
> retrieving timeline as a whole > timeline re- T HEANINE (Ours) 72.00
finement. This observation confirms the efficacy
of constructing a memory graph with causal rela- Table 3: Human evaluation of the accuracy of memory
tions. Moreover, utilizing this graph structure to retrieval (we examine 50 test instances).
collect timelines of relevant events yields higher
RG quality than conventional retrieval, despite the
(Finding 5) Humans confirm that T HEANINE
smaller k (3 vs. 6) in initial retrieval. Refining
yields responses better entailing past interac-
timelines shows smaller performance gains, sug-
tions. Now that the helpfulness of T HEANINE’s
gesting room for improvement in applying them
retrieved memories is validated, we proceed to in-
for RG. We leave it to future work.
vestigate whether such helpful memories contribute
towards reliable lifelong human-agent interaction.
(Finding 4) Humans and G-Eval reveal that For that, we further ask a group of workers to
T HEANINE leads to higher retrieval quality re- specifically judge whether agent responses entail,
garding both helpfulness and accuracy. Be- contradict, or are neutral to the past via majority
yond agent responses, we further investigate how voting. In Figure 6, T HEANINE not only leads to
different memory construction methods affect the a small number of contradictory responses (4%)
quality of memory retrieval. Given the same cur- but also demonstrates the largest percentage (68%;
8636
out of 100) of responses that entail past conversa- versations and correct memories for retrieval. Al-
tions, significantly outperforming baselines. We though we may resort to G-Eval by feeding eval-
argue that it is because our timeline-based approach uator LLMs (e.g., GPT-4) the entire past history
elicits memories better at representing past interac- and prompt it to determine whether a response cor-
tions between speakers, thus leading to responses rectly recalls the past, the evaluation can be largely
more directly aligned with the past. This alignment limited by the performance of the evaluator LLM
is important for dialogue agents to maintain long- itself (Kim et al., 2024b).
term intimacy with users (Adiwardana et al., 2020). To overcome this, along with T HEANINE, we
Furthermore, such entailing and non-contradictory present TeaFarm, a human-free counterfactual-
nature of T HEANINE’s responses highlights its po- driven pipeline for evaluating memory-augmented
tential for applications in specialized domains, such response generation in long-term conversations.
as personalized agents for clinical scenarios, where
entailment between agent responses and users’ past
information (e.g., electrical health records or pre-
5.1 Testing Dialogue Agents’ Memory via
vious consulting sessions) is crucial for diagnostic
Counterfactual Questions
decison-making (Tseng et al., 2024).
Entail Neutral Contradict In TeaFarm, we proceed to “trick” dialogue agents
Memory Retrieval

+ Memory Update
24%

34%
70%

64%
6%

2%
into generating incorrect responses, and agents
Rsum-LLM 42% 52% 6% must correctly reference past conversations to avoid
MEMOCHAT

COMEDY
44%

42%
54%

50%
2%

8%
being misled by us. Specifically, we talk to the
THEANINE (ours) 68% 28% 4% dialogue agent while acting as if a non-factual
statement is true (thus counterfactual). Figure 8
Figure 6: Human evaluations regarding to what extent
the agent responses entail past conversations. presents some examples of counterfactual ques-
tions and the corresponding facts.
As a side note, Memory Update yields fewer
contradictory responses (2%), indicating a potential Facts (at this moment) Counterfactural Questions
trade-off between (i) removing outdated memories Speaker B has never been to A: Hey, did you have a great
to prevent contradiction and (ii) preserving them to Japan. time in Tokyo?
get richer information for RG (Kim et al., 2024a). Speaker A bought a new house B: So you are still hesitating to
in NYC three months ago. buy that house in NYC you've
been talking about, right?
(Finding 6) Humans agree with T HEANINE’s in-
Speaker B does not own a car. B: Hey, do you remember when
termediate procedures. As reported in Figure 7, we sang karaoke in my car?
judges largely agree (92%) that T HEANINE prop-
erly assigns cause-effect relations to linked memo- Figure 8: Examples of counterfactual questions.
ries, which explains its contribution to performance.
Also, they agree that timeline refinement success-
fully elicits more helpful information (100%; 100 In practice (Figure 11), when we want to eval-
samples in total) for RG. Examples of T HEANINE’s uate an agent that has been interacting with the
phases and RG are in Appendix G. user for sessions, we first (1) collect all past con-
versations and summarize them session by session.
Agree Disagree
Then, we (2) feed a question generator LLM9 the
Appropriateness (Memory Linking) 92% 8%

Helpfulness (Timeline Refinement) 100%


collected summaries in chronological order such
that it can capture the current stage of each dis-
Figure 7: Human evaluation of our intermediate phases. cussed event, e.g., “Speaker B does not own a car”,
and (3) generate counterfactual questions from the
perspective of both speakers (and the correct an-
5 Evaluation Scheme 2: TeaFarm – a
swers). After that, we (4) kick off (i.e., simulate)
Counterfactual-driven Evaluation
a new dialogue session, chat for a while, then (5)
Pipeline for Long-term Conversations
naturally ask the counterfactual question, and (6)
Evaluating memory-augmented agents in long-term assess the correctness of its response. The overview
conversations is non-trivial due to the unavailabil- figure, prompts, and synthesized data for TeaFarm
ity of ground-truth mapping between current con- are in Appendix C, H, and D, respectively.
8637
Ours
 Ours
Settings / Datasets MSC CC Avg. (shuffled) Ours (w/o refinement)
Ours

Memory Retrieval 0.16 0.19 0.18 (w/o relation)
+ Memory Update 0.16 0.19 0.18 Memory Retrieval

RSum-LLM∗ 0.04 0.08 0.06 All Memories + D + Memory Update
MemoChat∗ 0.09 0.15 0.12 Memory Retrieval
All Memories + D

COMEDY∗ 0.06 0.18 0.12 All Dialogue + Memory Update
T HEANINE 0.18 0.24 0.21
MemoChat
w/o Relation-aware Linking 0.17 0.20 0.19 Pareto frontier
w/o Timeline Refinement 0.16 0.19 0.18 COMEDY Pareto-efficient methods
Rsum

Table 4: Success rates (SRs) of correctly recalling the


past and not being fooled by the counterfactual ques-
tions in TeaFarm (tested with 200 questions). Figure 9: Cost-performance comparisons.

5.2 TeaFarm Results Time efficiency. Time efficiency can be an im-


portant consideration when deploying T HEANINE
In Table 4, T HEANINE shows higher SR than base- to real-world scenarios having richer events. Fig-
lines, especially in CC. Ablations perform slightly ure 10 shows time-performance comparisons re-
worse than the original, again proving the efficacy garding both “memory construction” and “retrieval
of relation-aware linking and timeline refinement. + RG” also using the Pareto frontier. Similarly,
Surprisingly, all settings have low SRs, qualify- T HEANINE and many of its ablations demonstrate
ing TeaFarm as a proper pipeline for stress-testing an efficient time-performance trade-off.
dialogue agents in long-term conversations.
Interestingly, baselines using retrievers (same Ours
Ours (w/o refinement) Pareto frontier

as T HEANINE) show superior performance than


Ours (w/o relation)
Pareto-efficient methods
Ours (shuffled)

settings only relying on LLMs (i.e., RSum-LLM, All Memories + D


Memory Retrieval


+ Memory Update

MemoChat, and COMEDY). This, unexpectedly, Memory Retrieval

All Memories + D


supports our efforts in developing a new paradigm All Dialogue


+ Memory Update

of memory management in the era of LLMs.10 MemoChat

To provide insight regarding conversation sce- Rsum


COMEDY

narios that are challenging for dialogue agents, we


present case studies of how T HEANINE fail in Tea- Ours (w/o refinement)

Farm in Appendix G. Ours

Ours

Ours (shuffled)

(w/o relation)
All Memories + D

6 Further Analyses and Discussions


Memory Retrieval


+ Memory Update

All Memories + D


+ Memory Update Memory Retrieval

Cost efficiency. A concern of T HEANINE is the All Dialogue


MemoChat

API cost. Regardless, we argue that it is competi- COMEDY

tive when both performance and cost are taken into Rsum

account. Figure 9 plots response quality (Mauve


score) against the API cost.11 We find T HEANINE
Figure 10: Time-performance comparisons.
and all ablations not only outperform all baselines
but also lie on the Pareto frontier, indicating an
efficient cost-performance trade-off. This suggests Additional comparison: Memory Retrieval with
T HEANINE’s value when performance is prioritized a dynamically-changing k. Due to T HEANINE’s
over API costs. Actual API costs and results based graph-based procedures, the response generator
on B-4, R-L, and Bert scores are in Appendix I. may access different amounts of memories during
RG depending on given contexts (i.e., queries used
9
We apply GPT-4 (gpt-4) with a temperature of 0.75. by the retriever) and when the conversation takes
10
Memory update does not affect Memory Retrieval’s per-
formance. We believe it is because counterfactual questions place (i.e., an earlier or a later session), whereas
are made to counter the newest stage of each event. The conventional methods (Xu et al., 2022a; Bae et al.,
removal of older memories thus does not have much impact. 2022) often have a fixed number k of memories
11
Calculated based on session 5, which involves most mem-
ories for management. We use Mauve for its stronger correla- retrieved for RG. Therefore, to further quantify
tion with humans (Pillutla et al., 2021). the effect of our proposed timeline-based manage-
8638
Methods / Metrics Bleu-4 Rouge-L Mauve Bert adopt the concept of timelines. However, Park
Memory Retrieval (dynamic k) 3.06 17.97 33.33 87.32 et al. (2023) focus on tagging the timestamp (e.g.,
+ Memory Update 2.68 17.19 28.49 87.11
T HEANINE (Ours) 4.22 19.22 45.53 87.70 “22:00”) of events and does not explicitly model the
connection between them, and, in Maharana et al.
Table 5: Additional comparison, where k in Memory (2024), a timeline is a fixed, pre-defined series of
Retrieval is dynamically modified for each test instance. events (potentially unrelated) which simply serve
as a user profile for synthesizing dialogue data. By
ment and augmentation, we compare T HEANINE contrast, in our work, a timeline is built with rele-
to Memory Retrieval with a dynamic k, where k vant events, which are dynamically linked based on
dynamically changes based on the number of col- their causal relations and retrieved as the conver-
lected memories in T HEANINE for each specific sation goes on, benefitting our goal of consistent
test data. In other words, if T HEANINE uses time- memory tracking and integration.
lines to collect k memories during RG for a test in- Memory-augmentation for personalized dia-
stance Di , baselines will also be retrieving k mem- logue agents. The trend of long-term interaction
ories for generating a response for Di . with autonomous agents promotes their adaptation
In Table 5, we can observe that when the num- for personalized needs (Chen et al., 2024a,c). As
ber of memories is matched, ours outperforms both a pioneer, Xu et al. (2022b) train a persona ex-
baselines despite the same amount of memories tractor to create user-based memories. However,
being provided. We assume this is because: (i) our training personalized agents for long-term use can
graph-based retrieval helps us collect more ben- be non-trivial due to the lack of data (Tseng et al.,
eficial memories than conventional retrieval; (ii) 2024). As a solution, Kim et al. (2024a) apply com-
addressing the relation between events and shap- monsense models and LLMs to augment existing
ing them based on dialogue contexts can facilitate long-term data with high-quality persona sentences;
richer contextual cues for RG. Chen et al. (2024b) present a training-free LLM-
Growing span of memories. Another inquiry is based framework that extracts user behaviors from
whether the growing span of memory will eventu- conversations for personalized RG. Upon the suc-
ally hinder retrieval in T HEANINE if there ever are cess of LLMs, T HEANINE leverages them to build
hundreds of sessions. Although this may be a se- memory timelines. These timelines represent the
rious issue for conventional methods, we presume development of interactions and lead to responses
that it will be partially mitigated in T HEANINE, as: that better entail speaker information, establishing
(i) We retrieve relevant memories as a whole in the T HEANINE’s potential for personalized agents.
form of timelines. This serves as a safety net in
scenarios where an important memory is missed 8 Conclusions
out in top-k retrieval–it may be collected via the This paper presents the first-ever timeline-based
linked structure; (ii) We refine retrieved timelines memory management and augmentation frame-
based on current dialogue such that they provide work, T HEANINE, for autonomous agents in
tailored information for RG. This acts as a second long-term conversations. Applying T HEANINE, we
insurance against sub-optimal retrieval. develop a dialogue agent that efficiently addresses
the constant, lifelong tracking of memories and
7 Related Work
their integration for response generation through-
Long-term conversations. Since MSC, there out its service life. Comprehensive evaluations
have been several studies on long-term conversa- show that T HEANINE can facilitate more bene-
tions: Bae et al. (2022) train a classifier to update ficial memory augmentation, leading to responses
old memories in phone call scenarios. As we enter that are closer to ground truths and more aligned
the era of LLMs, Li et al. (2024) leverages LLMs with speakers’ past interactions. T HEANINE’s ef-
to write and update memories for RG. Apart from fectiveness is further confirmed in TeaFarm, a
LLMs’ power, human behaviors also foster meth- counterfactual-driven pipeline we design to ad-
ods in this field. For example, Zhong et al. (2024) dress the limitation of G-Eval and human efforts
apply humans’ forgetting curve to make memories in assessing memory augmentation. We expect our
that have been discussed exist longer. Recently, novel approaches to serve as a new foundation for
Park et al. (2023) and Maharana et al. (2024) also future efforts towards lifelong dialogue agents.
8639
Limitations by the Korean government (MSIT) (No. RS-2024-
00457882, National AI Research Lab Project), and
First, the amount of dialogue sessions in this study
was partially supported by the National Research
is limited to five due to the lack of longer open-
Foundation of Korea (NRF) grant funded by the
domain English datasets. As we mentioned in
Korea government (MSIT) (RS-2024-00333484;
Section 6, we presume that T HEANINE’s effective-
RS-2024-00414981). Jinyoung Yeo is the corre-
ness can still hold true to some degree in longer
sponding author ([email protected]).
conversations. Yet, we do acknowledge the need
to apply additional modules that directly address
the growing span of dialogue history/memories, References
such as introducing the summarize-then-compress
Daniel Adiwardana, Minh-Thang Luong, David R So,
paradigm in COMEDY (Chen et al., 2024b) to Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang,
compress session-level summaries into a combined Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu,
short user/event description. et al. 2020. Towards a human-like open-domain chat-
Second, although we include many recent frame- bot. arXiv preprint arXiv:2001.09977.
works as baselines, we failed to compare T HEA - Sanghwan Bae, Donghyun Kwak, Soyoung Kang,
NINE with MemoryBank (Zhong et al., 2024), a Min Young Lee, Sungdong Kim, Yuin Jeong, Hyeri
framework inspired by Ebbinghaus’s forgetting Kim, Sang-Woo Lee, Woomyoung Park, and Nako
curve. This is because the time intervals between Sung. 2022. Keep me updated! memory manage-
ment in long-term conversations. In Findings of the
sessions in MSC and CC are either mostly mea- Association for Computational Linguistics: EMNLP
sured in hours or not clearly specified (e.g., “a few 2022, pages 3769–3787.
months later”), whereas MemoryBank requires pre-
Frederic Charles Bartlett. 1995. Remembering: A study
cise time intervals in days to apply the forgetting in experimental and social psychology. Cambridge
curve. Also, data used for MemoryBank focuses on university press.
Chinese clinical scenarios, making it not feasible
for our study. However, we remain positive about Hyungjoo Chae, Namyoung Kim, Kai Tzu-iunn Ong,
Minju Gwak, Gwanwoo Song, Jihoon Kim, Sungh-
applying such a mechanism to improve T HEANINE wan Kim, Dongha Lee, and Jinyoung Yeo. 2024.
in our ongoing research. Web agents with world models: Learning and leverag-
Lastly, API-based LLMs may introduce risks ing environment dynamics in web navigation. arXiv
such as privacy issues. A possible solution is to preprint arXiv:2410.13232.
apply T HEANINE to small open-source LMs for Hyungjoo Chae, Yongho Song, Kai Ong, Taeyoon
secure, local usage. While there exist challenges in Kwon, Minjin Kim, Youngjae Yu, Dongha Lee,
data collection, one may achieve this by (i) collect- Dongyeop Kang, and Jinyoung Yeo. 2023. Dialogue
chain-of-thought distillation for commonsense-aware
ing synthesized conversations with GPT-generated
conversational agents. In Proceedings of the 2023
user profiles, (ii) running T HEANINE on these data, Conference on Empirical Methods in Natural Lan-
and (iii) using the outputs of each phase to train guage Processing, pages 5606–5632.
student LMs (i.e., distillation from teacher LLMs).
Howard Chen, Ramakanth Pasunuru, Jason Weston, and
Asli Celikyilmaz. 2023. Walking down the mem-
Ethical Statements ory maze: Beyond context limit through interactive
LLMs might generate harmful, biased, offensive, reading. arXiv preprint arXiv:2310.05029.
and sexual content. Authors avoid such content Jiangjie Chen, Xintao Wang, Rui Xu, Siyu Yuan, Yikai
from appearing in this paper. We guarantee fair Zhang, Wei Shi, Jian Xie, Shuang Li, Ruihan Yang,
compensation for human evaluators from Amazon Tinghui Zhu, et al. 2024a. From persona to person-
alization: A survey on role-playing language agents.
Mechanical Turk. We ensure an effective pay rate
arXiv preprint arXiv:2404.18231.
higher than 20$ per hour based on the estimated
time required to complete the tasks. Nuo Chen, Hongguang Li, Juhua Huang, Baoyuan
Wang, and Jia Li. 2024b. Compress to impress:
Acknowledgments Unleashing the potential of compressive memory in
real-world long-term conversations. arXiv preprint
This work was mainly supported by STEAM R&D arXiv:2402.11975.
Project, NRF, Korea (RS-2024-00454458) and In- Yi-Pei Chen, Noriki Nishida, Hideki Nakayama, and
stitute of Information & communications Technol- Yuji Matsumoto. 2024c. Recent trends in person-
ogy Planning & Evaluation (IITP) grant funded alized dialogue generation: A review of datasets,
8640
methodologies, and evaluations. arXiv preprint Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang,
arXiv:2405.17974. Ruochen Xu, and Chenguang Zhu. 2023. G-eval:
Nlg evaluation using gpt-4 with better human align-
Jena D Hwang, Chandra Bhagavatula, Ronan Le Bras, ment. In Proceedings of the 2023 Conference on
Jeff Da, Keisuke Sakaguchi, Antoine Bosselut, and Empirical Methods in Natural Language Processing,
Yejin Choi. 2021. (comet-) atomic 2020: On sym- pages 2511–2522.
bolic and neural commonsense knowledge graphs.
In Proceedings of the AAAI Conference on Artificial Junru Lu, Siyu An, Mingbao Lin, Gabriele Pergola, Yu-
Intelligence, volume 35, pages 6384–6392. lan He, Di Yin, Xing Sun, and Yunsheng Wu. 2023.
Memochat: Tuning llms to use memos for consis-
Bahar Irfan, Mariacarla Staffa, Andreea Bobu, and tent long-range open-domain conversation. arXiv
Nikhil Churamani. 2024. Lifelong learning and per- preprint arXiv:2308.08239.
sonalization in long-term human-robot interaction
(leap-hri): Open-world learning. In Companion of Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler
the 2024 ACM/IEEE International Conference on Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon,
Human-Robot Interaction, pages 1323–1325. Nouha Dziri, Shrimai Prabhumoye, Yiming Yang,
et al. 2024. Self-refine: Iterative refinement with
Jihyoung Jang, Minseong Boo, and Hyounghun Kim.
self-feedback. Advances in Neural Information Pro-
2023. Conversation chronicles: Towards diverse tem-
cessing Systems, 36.
poral and relational dynamics in multi-session con-
versations. In Proceedings of the 2023 Conference Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov,
on Empirical Methods in Natural Language Process- Mohit Bansal, Francesco Barbieri, and Yuwei Fang.
ing, pages 13584–13606, Singapore. Association for 2024. Evaluating very long-term conversational
Computational Linguistics. memory of LLM agents. In Proceedings of the 62nd
Hana Kim, Kai Ong, Seoyeon Kim, Dongha Lee, and Annual Meeting of the Association for Computational
Jinyoung Yeo. 2024a. Commonsense-augmented Linguistics (Volume 1: Long Papers), pages 13851–
memory construction and management in long-term 13870, Bangkok, Thailand. Association for Compu-
conversations via context-aware persona refinement. tational Linguistics.
In Proceedings of the 18th Conference of the Euro-
MetaAI. 2024. Llama3. https://ai.meta.com/
pean Chapter of the Association for Computational
blog/meta-llama-3-1/.
Linguistics (Volume 2: Short Papers), pages 104–123,
St. Julian’s, Malta. Association for Computational Seyed Mahed Mousavi, Simone Caldarella, and
Linguistics. Giuseppe Riccardi. 2023. Response generation in
Seungone Kim, Juyoung Suk, Shayne Longpre, longitudinal dialogues: Which knowledge represen-
Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham tation helps? In The 5th Workshop on NLP for Con-
Neubig, Moontae Lee, Kyungjae Lee, and Minjoon versational AI, page 1.
Seo. 2024b. Prometheus 2: An open source language
model specialized in evaluating other language mod- Bao Trong Nguyen, Naveen Sharma, Eun-Joo Shin,
els. arXiv preprint arXiv:2405.01535. Ji Hoon Jeong, Sung Hoon Lee, Choon-Gon Jang,
Seung-Yeol Nah, Toshitaka Nabeshima, Yukio
Taeyoon Kwon, Kai Tzu-iunn Ong, Dongjin Kang, Yoneda, and Hyoung-Chun Kim. 2019. Theanine
Seungjun Moon, Jeong Ryong Lee, Dosik Hwang, attenuates memory impairments induced by klotho
Beomseok Sohn, Yongsik Sim, Dongha Lee, and Jiny- gene depletion in mice. Food & function, 10(1):325–
oung Yeo. 2024. Large language models are clinical 332.
reasoners: Reasoning-aware diagnosis framework
with prompt-generated rationales. In Proceedings of OpenAI. 2023. Chatgpt. https://openai.com/blog/
the AAAI Conference on Artificial Intelligence, vol- chatgpt.
ume 38, pages 18417–18425.
OpenAI. 2024a. Openai website. https://openai.
Hao Li, Chenghao Yang, An Zhang, Yang Deng, Xiang com/.
Wang, and Tat-Seng Chua. 2024. Hello again! llm-
powered personalized agent for long-term dialogue. OpenAI. 2024b. Openai’s text embeddings.
arXiv preprint arXiv:2406.05925.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Chin-Yew Lin. 2004. ROUGE: A package for auto- Jing Zhu. 2002. Bleu: a method for automatic evalu-
matic evaluation of summaries. In Text Summariza- ation of machine translation. In Proceedings of the
tion Branches Out, pages 74–81, Barcelona, Spain. 40th Annual Meeting of the Association for Compu-
Association for Computational Linguistics. tational Linguistics, pages 311–318, Philadelphia,
Pennsylvania, USA. Association for Computational
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paran- Linguistics.
jape, Michele Bevilacqua, Fabio Petroni, and Percy
Liang. 2024. Lost in the middle: How language mod- Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Mered-
els use long contexts. Transactions of the Association ith Ringel Morris, Percy Liang, and Michael S Bern-
for Computational Linguistics, 12. stein. 2023. Generative agents: Interactive simulacra
8641
of human behavior. In Proceedings of the 36th An- A Appendix Contents
nual ACM Symposium on User Interface Software
and Technology, pages 1–22. • Appendix B.1: Cause-effect Commonsense
Relations Adopted.
Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers,
John Thickstun, Sean Welleck, Yejin Choi, and Zaid
Harchaoui. 2021. Mauve: Measuring the gap be- • Appendix B.2: Algorithms for T HEANINE.
tween neural text and human text using divergence
frontiers. Advances in Neural Information Process- • Appendix B.3: Implementation Details on
ing Systems, 34:4816–4828. Computational Experiments.
Chongyang Tao, Jiazhan Feng, Tao Shen, Chang Liu,
Juntao Li, Xiubo Geng, and Daxin Jiang. 2023. Core: • Appendix C: TeaFarm Evaluation.
Cooperative training of retriever-reranker for effec-
tive dialogue response selection. In Proceedings • Appendix D: The TeaBag Dataset.
of the 61st Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), • Appendix E Details on Evaluation Scheme 1
pages 3102–3114. (G-Eval and Human Evaluations).
Yu-Min Tseng, Yu-Chao Huang, Teng-Yun Hsiao, Yu-
Ching Hsu, Jia-Yin Foo, Chao-Wei Huang, and Yun- • Appendix F: Session-specific Results of Auto-
Nung Chen. 2024. Two tales of persona in llms: A matic Evaluation.
survey of role-playing and personalization. arXiv
preprint arXiv:2406.01171. • Appendix G: Empirical Examples.
Qingyue Wang, Liang Ding, Yanan Cao, Zhiliang Tian,
Shi Wang, Dacheng Tao, and Li Guo. 2023. Re- • Appendix H: Prompts Used in This Work.
cursively summarizing enables long-term dialogue
memory in large language models. arXiv preprint • Appendix I: Further Analyses.
arXiv:2308.15022.
• Appendix J: Terms for Use of Artifacts.
Jing Xu, Arthur Szlam, and Jason Weston. 2022a. Be-
yond goldfish memory: Long-term open-domain con-
versation. In Proceedings of the 60th Annual Meet- B Further Implementation Details
ing of the Association for Computational Linguistics
(Volume 1: Long Papers), pages 5180–5197, Dublin, B.1 Cause-effect Commonsense Relations
Ireland. Association for Computational Linguistics.
We adopt and modify commonsense relations
Xinchao Xu, Zhibin Gou, Wenquan Wu, Zheng-Yu Niu, from Hwang et al. (2021) for our relation-aware
Hua Wu, Haifeng Wang, and Shihang Wang. 2022b. memory linking. Below is the list of our common-
Long time no see! open-domain conversation with sense relations R:
long-term persona memory. In Findings of the As-
sociation for Computational Linguistics: ACL 2022, Changed: Events in A changed to events in B.
pages 2639–2650. Cause: Events in A caused events in B.
Reason: Events in A are due to events in B.
Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur
Szlam, Douwe Kiela, and Jason Weston. 2018. Per-
HinderedBy: When events in B can be hindered
sonalizing dialogue agents: I have a dog, do you by events in A, and vice versa.
have pets too? In Proceedings of the 56th Annual React: When, as a result of events in A, the subject
Meeting of the Association for Computational Lin- feels as mentioned in B.
guistics (Volume 1: Long Papers), pages 2204–2213,
Want: When, as a result of events in A, the subject
Melbourne, Australia. Association for Computational
Linguistics. wants events in B to happen.
SameTopic: When the specific topic addressed in
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Wein- A is also discussed in B.
berger, and Yoav Artzi. 2020. Bertscore: Evaluating
text generation with bert. In International Confer- Limited by the performance of retrievers, it is
ence on Learning Representations. possible that an m ∈ Ma does not have a relation,
other than just textual overlap, with mnew . We
Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and
Yanlin Wang. 2024. Memorybank: Enhancing large address this by allowing the LLM to output None.
language models with long-term memory. In Pro-
ceedings of the AAAI Conference on Artificial Intelli- B.2 Algorithms for T HEANINE
gence, volume 38, pages 19724–19731.
The pseudo algorithms for Phase I and II are pro-
vided in Algorithm 1 and 2.
8642
B.3 Implementation Details on not specified, we sample 50 episodes from each
Computational Experiments dataset for experiments in this paper (around 3.6K
All computational experiments in this work are conversational turns in total).
based on OpenAI API (OpenAI, 2024a). Thus, no E.1 G-Eval
computing infrastructure is required in this work.
G-Eval (Liu et al., 2023) is a framework using
C TeaFarm Evaluation LLMs with chain-of-thoughts (CoT) and a form-
filling paradigm to assess the quality of models’
The overall pipeline of TeaFarm is illustrated in text generation. G-Eval with GPT-4 has been
Figure 11. shown to generate evaluation results that highly
align with human judgement (Liu et al., 2023; Kim
D The TeaBag Dataset
et al., 2024b) and thus has been widely applied in
As a byproduct of TeaFarm, we curate TeaBag, many LM-based projects. We conduct G-Eval on 5
a dataset for TeaFarm evaluation on MSC and CC. episodes.
TeaBag consists of: The prompt for evaluating the helpfulness of
retrieved memories is in Figure 26. We use SciPy
• 100 episodes of original conversations to calculate p-values.12
from Multi-Session Chat and Conversational
Chronicles (session 1-5; 50 episodes from E.2 Human Evaluation
each dataset) We conduct human evaluation, with workers from
Amazon Mechanical Turk (AMT). We construct
• Two pairs of counterfactual QAs for each
the following three evaluations:
episode (200 pairs in total).
• Appropriateness of relation-aware memory
• Two synthesized follow-up conversations (i.e.,
linking: In this evaluation, we ask the workers
session 6) for each episode (thus 200 in to-
to judge whether they agree that the relation-
tal), each of which naturally guides the con-
aware linking is properly done for two given
versation from session 5 towards one of the
memories. The interface provided to AMT
counterfactual questions.
workers, which includes detailed instructions
This dataset is made with GPT-4. The prompt for for human evaluation, is shown in Figure 12.
generation is in Appendix H. We expect future • Helpfulness of context-aware timeline re-
work to apply TeaBag to stress-test if their dialogue finement: This evaluation requires the work-
system can correctly reference past conversations. ers to determine if they agree that our context-
TeaBag does not contain personally identifying aware refinement really tailors a raw timeline
information, as it is generated based on datasets into a resource of useful information for gen-
where all contents are pure artificial creation, rather erating the next response. The interface pro-
than contents collected from the real-world. Also, vided to AMT workers, which includes de-
we have tried our best to confirm that this dataset tailed instructions for human evaluation, is
does not contain any offensive content. shown in Figure 13.
For the overview of data collection, please refer
to step 1-4 of TeaFarm (Figure 11). • The quality of responses: Here, the workers
are asked to judge if the responses correctly
E Details on Evaluation Scheme 1 refer to past conversations. After reading our
We perform evaluations using sessions 3-5 from responses and past memories, they choose
MSC and CC, as all settings are almost identical whether the responses entail, contradict, or
before the end of session 2, due to the fact that are neutral to past memories. To improve
there is no memory to update before then. evaluation quality, we use GPT-4 to select
The test sets of MSC and CC contain over 500 responses for this specific evaluation based
and 20,000 episodes of conversations, where each on past memories, addressing the fact that not
episode has 5 dialogue sessions, yielding 1.2M every turn in the conversation requires previ-
turns of responses in total. Due to the limited bud- ous information to generate the next response
get for generation (both baselines and ours), when 12
https://scipy.org/

8643
(In the other two evaluations, the samples are sometimes make the length of input tokens too long
randomly selected). The interface provided such that the agent cannot properly utilize key infor-
to AMT workers, which includes detailed in- mation provided. We believe this can be resolved
structions for human evaluation is shown in to an extent via dedicated prompt (i.e., the prompt
Figure 14. for RG) engineering. We leave this to future work.

• The helpfulness of retrieved memories: H Prompts


Given the same dialogue context, human
The following are all prompts utilized in our study:
workers are asked to select a memory that
is more helpful for generating a next response • Relation-aware memory linking (Phase I-2):
from ours’ and a baseline’s retrieval. The Figure 22.
interface provided to AMT workers, which
includes detailed instructions for human eval- • Context-aware timeline refinement (Phase II-
uation is shown in Figure 15 2): Figure 23.

Each data sample is judged by 3 different work- • Timeline-augmented Response generation


ers, and we report the results based on the majority (Phase III): Figure 24.
rule. In the third evaluation, when every option • Memory Update (baseline): Figure 25.
(entailment, neutral, contradiction) gets one vote,
we consider it neutral (13 samples in total). These • RSum-LLM (baseline): We adopt the original
human evaluations are conducted on 100 conversa- prompt from Wang et al. (2023).
tional turns.
• MemoChat (baseline): We adopt the original
F Session-specific Evaluation Results prompt from Lu et al. (2023).

We provide session-specific results for automatic • COMEDY (baseline): We adopt the original
evaluations in Table 9. prompt from Chen et al. (2024b).

• G-Eval: The prompt for evaluating the help-


G Empirical Examples
fulness of retrieved memories is in Figure 26.
Outputs from T HEANINE. We provide several
• Generating counterfactual QA in TeaFarm:
empirical examples of T HEANINE. Examples of
Figure 27.
relation-aware memory linking are in Figure 16, 17,
and 18. Examples of utilizing refined timeline for • Generating session 6 in TeaFarm: Figure 28.
response generation are in Figure 19.
• Evaluating model responses in TeaFarm: Fig-
How T HEANINE fails in TeaFarm. We present ure 29.
failure cases where T HEANINE fails to pass the
TeaFarm test in Figure 20 and Figure 21. In Fig- I Further Analyses
ure 20, although the conversation has shifted to
Memory summarization. At the end of each ses-
“librarian”, the similarity-based retriever retrieves
sion, we use ChatGPT (gpt-3.5-turbo-0125) to sum-
unhelpful memories due to the huge portion of “kid”
marize the conversation into memory sentences.
in the context. While a helpful memory (i.e., “A
We conduct examinations on such summarization
is a retired libraria”) is eventually caught by our
using 100 randomly sampled sessions from MSC
designed timeline structure, the LLM still halluci-
and CC to make sure the quality of raw memories
nate. We assume it is due to the noises introduced
is acceptable. The result is in Table 6.
by those highly-ranked, yet irrelevant memories,
and it highlights the need for addressing helpful- Memories that ... No Can’t judge Yes
ness ranking among retrieved memories in lifelong
Contain faulty statements 90% 9% 1%
dialogue systems. Figure 21 shows a failure case, Miss important statements 95% 4% 1%
where T HEANINE successfully retrieves the cor-
rect memories but generates an improper response. Table 6: Human evaluation of conversation-to-memory
We hypothesize that this is because relation-aware summarization in T HEANINE.
linking and context-aware timeline refinement may
8644
Cost-efficiency trade-off assessed using other
metrics. In Section 6, we have presented meth-
ods having an efficient cost-performance trade-off
(i.e., are Pareto-efficient) by plotting the Mauve
score against API cost (Figure 9). We present meth-
ods that are Pareto-efficient when considering the
other three metrics used in our study, i.e., B-4, R-L,
and Bert Score, in Table 7.

Agents B-4 R-L Bert Score

All Dialogue History


All Memories
+ Update
Memory Retrieval ✓ ✓
+ Update
Rsum-LLM
MemoChat
COMEDY

T HEANINE (ours) ✓ ✓
w/o Relation-aware Linking
w/o Refinement ✓
Shuffled Timeline ✓ ✓ ✓

Table 7: Methods considered Pareto-efficient when


judged based on B-4, R-L, and Bert Score reported
in Table 1. ✓ = Pareto-efficient methods.

API costs. The actual API costs of all settings


(ours and baselines) are in Table 8.

Agents Cost Ratio (ours = 1) Cost (per episode; $)

All Dialogue History 0.50 0.0067


All Memories & D 0.27 0.0036
+ Update 5.71 0.0771
Memory Retrieval 0.17 0.0023
+ Update 5.63 0.0760
Rsum-LLM 0.42 0.0057
MemoChat 0.52 0.0076
COMEDY 0.61 0.0082

T HEANINE (ours) 1.00 0.0135


w/o Relation-aware Linking 0.50 0.0067
w/o Refinement 0.71 0.0096
Shuffled Timeline 0.20 0.0027

Table 8: API costs for T HEANINE and baselines.

J Terms for Use of Artifacts


We adopt the MSC and CC datasets from Xu et al.
(2022a) and Jang et al. (2023), respectively. Both of
these datasets are open-sourced for academic and
non-commercial use. Our curated dataset, TeaBag,
which will be released after acceptance, is open to
academic and non-commercial use.

8645
Algorithm 1 Memory Graph Construction (Phase I)
Require: Memory graph Gt = (V t , E t )
Require: New memories Mnew = {mnew1 , ..., mnewN }
Require: Set of relations R = {Cause, Reason, Want, ..., SameTopic}
Ensure: Memory graph Gt+1 = (V t+1 , E t+1 )
(
ri,j , if mi is assigned with ri,j ∈ R with mj
1: Υ(mi , mj ) =
None, otherwise
2: Ω(V ) = (the most recent memory m ∈ V )
3: Et+1 ← Et
4: for mnew ∈ Mnew do
5: Ma ← {mi ∈ V t | mi has top-j similarity with mnew }
6: Ma∗ ← {mi ∈ Ma | Υ(mi , mnew ) = r for r ∈ R}
7: C ← {Ci | Ci connected component of Gt s.t. V(Ci ) ∩ Ma∗ ̸= ∅ }
8: Mlinked ← {Ω(V(Ci ) ∩ Ma∗ ) | Ci ∈ C}
9: Enew ← {⟨mi , Υ(mi , mnew ), mnew ⟩ | mi ∈ Mlinked }
10: Et+1 ← Et+1 + Enew
11: end for
12: V t+1 ← V t + Mnew
13: Gt+1 ← (V t+1 , E t+1 )
14: return Gt+1

Algorithm 2 Timeline Retrieval and Timeline Refinement (Phase II)


Require: Memory graph G = (V, E)
Require: Dialogue context D = {ui }ni=1
Ensure: Collection of refined timelines TΦ
1: Θ(V ) = (the oldest memory m ∈ V )
2: Mre ← {mi ∈ V | mi has top-k similarity with D}
3: Cre ← {Cre | Cre connected component of G s.t. V(Cre ) ∩ Mre ̸= ∅}
4: T ← {}
5: for Cre ∈ Cre do
6: mstart ← Θ(V(Cre ))
7: T = {τ ⊂ Cre | τ is a directed linear graph s.t. mstart , mre ∈ τ ∧ deg + (τ [−1]) = 0}
8: T ← T + RandomSelection(T )
9: end for
10: TΦ ← {argmax PLLM (TΦ |D, τ ) | τ ∈ T}

11: return TΦ

8646
Conducting Long-term Conversations Collecting Summaries Generating Counterfactual QAs

Dialogue Session 1 - 5 Summaries in Chronological Order Remember I said I want to

learn Korean last time?


Hey, long time no

see! Didn’t you just ...

B just got back from Spain.


No, You said you want to
learn Spanish because...
A
So when I was in
B is interested in Spanish.
Spain, I went to a ... GPT-4
Did you had a great

...
Tokyo?
...

B time in
A has never been to Japan.
Sounds great! Let’s grab I told you I have

some foods next ... never been to Japan.

S ynthesizing Dialogue Session 6 Asking the Counterfactual Question Measuring Answer Correctness
Session 5 + Counterfactual Q
Feed the model with the
Correct Answer
synthesized session 6
I told you I have never been to ...

System Response
Yes, I has a great time there.

My wife and I went to a sashimi ...
...
...

Did you had a great

time in Tokyo?
GPT-4

Naturally bringing up the Yes, I has a great time there.



counter factual question in session 6
My wife and I went to a sashimi ... “Incorrect. Because ...”

Figure 11: The overview of TeaFarm Evaluation.

Datasets: Multi-session Chat (MSC) & Conversation Chronicals (CC)


Session: Session 3 Session 4 Session 5
Methods / Metrics B-4 R-L Mauve Bert B-4 R-L Mauve Bert B-4 R-L Mauve Bert
All Dialogue History 3.13 18.04 17.34 87.17 3.17 17.96 18.54 87.12 3.53 18.69 17.42 87.31
All Memories & Current Context D 2.69 17.29 28.30 87.10 3.10 17.38 22.52 87.06 3.16 17.75 22.35 87.21
+ Memory Update (Bae et al., 2022) 2.80 17.51 22.92 87.11 2.88 17.24 21.22 86.99 3.16 17.90 22.04 87.24
Memory Retrieval (Xu et al., 2022a) 3.44 18.33 24.68 87.30 3.38 17.55 21.95 87.17 3.46 18.31 19.70 87.33
+ Memory Update (Bae et al., 2022) 3.10 18.08 25.02 87.24 2.99 17.37 25.97 87.10 3.11 17.78 20.99 87.28
Rsum-LLM∗ (Wang et al., 2023) 0.83 11.30 2.45 85.25 0.87 11.35 2.32 82.20 0.90 11.78 2.33 85.30
MemoChat∗ (Lu et al., 2023) 1.88 14.83 14.56 86.57 1.81 14.27 10.57 86.43 1.91 14.96 9.13 86.56
COMEDY∗ (Chen et al., 2024b) 1.14 12.80 4.74 85.53 1.57 13.18 5.16 85.56 1.42 13.56 3.94 85.69
T HEANINE (Ours) 4.21 19.21 45.53 87.70 4.42 18.63 37.84 87.52 4.34 19.23 41.18 87.70

Table 9: Session-specific results of agent performance in response generation.

8647
We are surveying qualities for relation between sentence A and B.

Specifically, you will be given two sentences, A and B, along with a relation between them. You will be asked to determine if the
relation between the two sentences is properly linked. In other words, the evaluation criteria is based on the appropriateness of
the relation between the two sentences.

Relations:
1. Changed: when events in [Sentence A] changed to events in [Sentence B]
2. Causes: when events in [Sentence A] caused events in [Sentence B]
3. Reason: when events in [Sentence A] are due to events in [Sentence B]
4. HinderedBy: when events in [Sentence B] can be hindered by events in [Sentence A], and vice versa
5. React: when, as a result of events in [Sentence A], the subject feels as mentioned in [Sentence B]
6. Want: when, as a result of events in [Sentence A], the subject wants events in [Sentence B] to happen
7. SameTopic: when the specific topic addressed in [Sentence A] is also discussed in [Sentence B]
8. None: when [Sentence A] and [Sentence B] are irrelevant

Guidelines:
1. There are four choices: Definitely Disagree / Agree and Slightly Disagree / Agree
2. Please trust your instincts and choose Definitely if you would feel more confident giving one response, versus the other one.

Sentence A
${sentence_a}

Relation
${relation}

Sentence B
${sentence_b}

Q1. Do you think the relation between the two sentences is properly linked?

Definitely Disagree Slightly Disagree Slightly Agree Definitely Agree

Optional feedback? (expand/collapse)

Figure 12: Interface for human evaluation regarding memory linking.

We are surveying qualities for refinement from linked sentences.

You will be given a sequence of two sentence connected with one relation, and a refined version of it. Your task is to judge whether
the refinement was done appropriately, such that the refined sentences can serve as an useful information source for you to make a
next response based on the dialogue context.

In other words, the criterion for judgment is appropriateness of refinement .

Relations:
1. Changed: when events in [Sentence A] changed to events in [Sentence B]
2. Causes: when events in [Sentence A] caused events in [Sentence B]
3. Reason: when events in [Sentence A] are due to events in [Sentence B]
4. HinderedBy: when events in [Sentence B] can be hindered by events in [Sentence A], and vice versa
5. React: when, as a result of events in [Sentence A], the subject feels as mentioned in [Sentence B]
6. Want: when, as a result of events in [Sentence A], the subject wants events in [Sentence B] to happen
7. SameTopic: when the specific topic addressed in [Sentence A] is also discussed in [Sentence B]
8. None: when [Sentence A] and [Sentence B] are irrelevant

Guidelines:
1. There are four choices: Definitely Disagree / Agree and Slightly Disagree / Agree
2. Please trust your instincts and choose Definitely if you would feel more confident giving one response, versus the other one.

Dialogue Context
${dialogue}

Before Refinement (See the types of relation mentioned above)


${before_refinement}

After Refinement
${after_refinement}

Q1. Do you think that the sentence after refinement is appropriately refined considering the dialogue context and its
relations?

Definitely Disagree Slightly Disagree Slightly Agree Definitely Agree

Optional feedback? (expand/collapse)

Figure 13: Interface for human evaluation regarding timeline refinement.

8648
We are surveying qualities for response from a given dialogue context.

Specifically, you will be given speaker information in chronological order, a dialogue context, and a response to the last utterance in
the dialogue context. You will be asked to judge the quality of the response to the last utterance.

Criteria:
1. Entail : When the response to the last utterance in dialogue context appropriately reflects given information.

2. Neutral : Although the response does not reflect speaker information, it does not contradict them either

3. Contradictory : when the response to the last utterance in dialogue context contains statement that contradicts the "most
up-to-date information about that statement."

Speaker information in chronological order


${memory}

Dialogue Context
${dialogue}

Response
${response}

Q1. Base on the criteria, select an option that fits the response.

Entail Neutral Contradictory

Optional feedback? (expand/collapse)

Figure 14: Interface for human evaluation regarding referencing past conversations in responses.

Figure 15: Interface for human evaluation regarding the helpfulness of retrieved memories.

8649
Example 1 - [Changed]
[Before Linking]
Memory 1: Classmates A was initially hesitant about following Classmates B's advice.
Memory 1’s Contextual Background:
Classmates A: Thank you for the advice, but I'm not sure if I should follow it.

Memory 2: Classmates A was initially hesitant but received positive responses after starting the blog.
Memory 2’s Contextual Background:
Classmates A: Yeah, it was scary at first, but the response has been really positive.

[After Linking]
Classmates A was initially hesitant about following Classmates B's advice - [Changed] - Classmates A was initially
hesitant but received positive responses after starting the blog

Example 2 - [Cause]
[Before Linking]
Memory 1: The Child feels it is unfair that they have to do certain chores because the Parent is too tired.
Memory 1’s Contextual Background:
Child: But Mom, it's not fair that we have to wash the dishes because you're too lazy to do it.

Memory 2: The Parent acknowledges being lazy about washing dishes and promises to contribute more to keeping
the home clean.
Memory 2’s Contextual Background:
Parent: I realized how lazy I've been lately, especially when it comes to washing the dishes.
Parent: From now on, I promise to do my fair share and contribute more to keeping our home clean and organized.

[After Linking]
The Child feels it is unfair that they have to do certain chores because the Parent is too tired - [Cause] - The Parent
acknowledges being lazy about washing dishes and promises to contribute more to keeping the home clean

Example 3 - [Reason]
[Before Linking]
Memory 1: Speaker A has multiple sons, at least one of them is in a relationship with a Spanish girlfriend.
Memory 1’s Contextual Background:
Speaker A: One of my sons just told me that he has a Spanish girlfriend now.
Speaker A: . . . I'm visiting my son that lives in Spain next month. This will give me a chance to finally meet his
girlfriend of three years now!

Memory 2: Speaker A is interested in learning Spanish and Portuguese before her trip.
Memory 2’s Contextual Background:
Speaker A: Sounds great! I'm already very excited about my trip to Spain, and now I get to visit you in Lisbon! I need
to brush up on my Spanish and also start studying Portuguese.

[After Linking]
Speaker A has multiple sons, at least one of them is in a relationship with a Spanish girlfriend - [Reason] - Speaker A is
interested in learning Spanish and Portuguese before her trip

Figure 16: Examples of Relation-aware Memory Linking - 1.

8650
Example 4 - [HinderedBy]
[Before Linking]
Memory 1: Speaker B is currently re-reading 'Redwall' by Brian Jacques, which was a favorite book growing up.
Memory 1’s Contextual Background:
Speaker B: I'm recently re-reading Redwall by Brian Jacques! It was one of my favorites growing up. Have you ever
read it?

Memory 2: Speaker B has been busy with a new painting and has not had time to read.
Memory 2’s Contextual Background:
Speaker B: I think I would but I have been too busy with a new painting to get in some reading.

[After Linking]
Speaker B is currently re-reading 'Redwall' by Brian Jacques, which was a favorite book growing up - [HinderedBy] -
Speaker B has been busy with a new painting and has not had time to read

Example 5 - [React]
[Before Linking]
Memory 1: The Mentee hopes to inspire others to join the cause of gender equality and fighting discrimination.
Memory 1’s Contextual Background:
Mentee: I agree. We need more people advocating for gender equality and fighting against discrimination.

Memory 2: The Mentor acknowledges the Mentee’s work in advocacy for women and girls and praises their
dedication to their values.
Memory 2’s Contextual Background:
Mentor: . . . I think this is a great reflection of the work that you've done in advocating for women and girls.
Mentor: Absolutely. And I have no doubt that your dedication to these principles will serve you well in this new job.

[After Linking]
The Mentee hopes to inspire others to join the cause of gender equality and fighting discrimination - [React] - The
Mentor acknowledges the Mentee’s work in advocacy for women and girls and praises their dedication to their values

Figure 17: Examples of Relation-aware Memory Linking - 2.

8651
Example 6 - [Want]
[Before Linking]
Memory 1: Neighbors A and B don't know each other well and want to spend more time together.
Memory 1’s Contextual Background:
Neighbors A: . . . I feel like I don't know you well enough.
Neighbors A: Well, maybe we could hang out once a week or something.

Memory 2: Neighbor A enjoys spending time in Neighbor B's cozy home and wants to hang out more often.
Memory 2’s Contextual Background:
Neighbors A: It's okay, I love spending time in your cozy home. And speaking of spending time, can we hang out more
often?

[After Linking]
Neighbors A and B don't know each other well and want to spend more time together - [Want] - Neighbor A enjoys
spending time in Neighbor B's cozy home and wants to hang out more often

Example 7 - [SameTopic]
[Before Linking]
Memory 1: Speaker A enjoys reading sci-fi and mysteries, while Speaker B prefers fantasy books.
Memory 1’s Contextual Background:
Speaker A: I prefer sci-fi but here recently I have been craving a god mystery.
Speaker B: . . . I mostly read fantasy books myself.

Memory 2: Speaker B enjoys reading the Odd Thomas and Dark Tower series and finds inspiration for their artwork
during nature walks.
Memory 2’s Contextual Background:
Speaker B: I felt that way about the Odd Thomas series. Could never wait for the next one to come out.
Speaker B: I think I may start re-reading the entire Dark Tower series. And continue to work for new works that
interest me.

[After Linking]
Speaker A enjoys reading sci-fi and mysteries, while Speaker B prefers fantasy books - [SameTopic] - Speaker B enjoys
reading the Odd Thomas and Dark Tower series and finds inspiration for their artwork during nature walks.

Figure 18: Examples of Relation-aware Memory Linking - 3.

8652
Example 1
[Retrieved Raw Timelines]
Memory 1: Speaker B is in love with their neighbor, John, and shared it as a secret.
[React]
Memory 2: Speaker A knows about a person named John and suggests Speaker B talk to him about their feelings.
[Want]
Memory 3: Speaker A finds the situation exciting and wishes for more excitement in their life.

[Current Dialogue Context]


Speaker A: So how's everything going with you and John?

[Timeline after Context-aware Refinement]


Speaker B confessed their love for their neighbor, John, which Speaker A found out and encouraged them to talk to him.
Excited by this revelation, Speaker A wishes for more excitement in their own life.

[Response Generated Based on the Refined Timeline]


Speaker B: I finally mustered up the courage to talk to John, and it went better than I expected. He was really
understanding and open to having a conversation about our feelings. I'm glad I took your advice and had that talk.

Example 2
[Retrieved Raw Timelines]
Memory 1: The coach provides information about the benefits of bean sprouts and the importance of a balanced diet
for athletes.
[SameTopic]
Memory 2: The Athlete has incorporated bean sprouts into their diet to improve health, leading to increased energy
and faster recovery.

[Current Dialogue Context]


Coach: Definitely. Speaking of energy levels, how's your nutrition plan been going? Have you been keeping up with
eating healthy foods like those bean sprouts we talked about?

[Timeline after Context-aware Refinement]


The Coach advises the Athlete on performance and focus, highlighting the benefits of bean sprouts and a balanced diet,
leading the Athlete to include bean sprouts in their diet, resulting in improved health, increased energy, and faster
recovery.

[Response Generated Based on the Refined Timeline]


Absolutely, Coach. I've been incorporating bean sprouts into my diet regularly, and I can feel the difference in my energy
levels and recovery. It's been making a positive impact on my performance.

Example 3
[Retrieved Raw Timelines]
Memory 1: Speaker A is a lifeguard and plans to propose to his girlfriend on the beach.
[Changed]
Memory 2: Speaker A wants to propose at the movie theater where they first met by hiding the ring in a bucket of
popcorn.
[SameTopic]
Memory 3: Speaker A is planning to propose to their girlfriend with a custom-made solitaire ring on a yellow band with
little diamonds.

[Current Dialogue Context]


Speaker A: I officially put in the order and paid a deposit for the ring!

[Timeline after Context-aware Refinement]


Speaker A is a lifeguard who plans to propose to his girlfriend on the beach, but now wants to propose at the movie
theater where they first met by hiding the ring in a bucket of popcorn. He is planning to propose to their girlfriend with a
custom-made solitaire ring on a yellow band with little diamonds after officially putting in the order and paying a
deposit for the ring.

[Response Generated Based on the Refined Timeline]


That's awesome! I know you were originally thinking about proposing on the beach, but the movie theater where you
first met sounds perfect. Hiding the ring in a bucket of popcorn is such a creative and meaningful idea. She's going to
love it!

Figure 19: Examples of Timeline Refinement and Response Generation.

8653
Current Conversation:
A: How have you been, B? And how's the grandkids?
B: We're all fine, thank you very much. The grandkids are growing up so fast! They're already into new things. How
about your family?
A: They're doing well, thank you. My kids have started taking guitar lessons. I think they're quite enjoying it.
…{omited for clarity}...
B: I've been spending a lot of time with the grandkids and doing some gardening.
A: Well, it sounds like you've been keeping busy. I hope you can visit the library again soon. We could use a
volunteer like you. B: You're still working as a librarian, aren't you? (Counterfactual question)

Target Fact:
A is a retired librarian, who now works as a volunteer, introducing creative activities at the library.

Retrieved Timelines (we remove refinement and present the raw memories for clarity):
Color blue = memories originally retrieved by the retriever.
●​ [Speaker A is busy with work and unable to spend as much time with their kids as they would like],
[Speaker B enjoys spending time with their grandkids].
●​ [Speaker B's mother was a librarian, which seems to have shaped their view of the profession], [Speaker
A is a retired librarian], [Speaker A introduces creative activities at the library].

Theanine’s Wrong Response:


A: Yes, I am. Even though it keeps me quite occupied, I always find joy in introducing new creative activities for the
kids at the library."

Figure 20: T HEANINE fails to pass TeaFarm (Example 1) - Due to sudden topic change.

8654
Current Conversation:
A: I've been working on improving my stamina lately.
B: Why is that?
A: I've realized that I need to be physically fit to keep up with my studies and other activities. It's not just about the
mind, but the body as well.
B: That's really insightful of you. Any noticeable improvements yet?
A: Absolutely. I feel like I can focus better on my studies and I'm less tired during the day.
…{omited for clarity}...
B: It's nice to see you've found a balance between hard work and relaxation.
A: Hey, remember when I once lost a race to my brother?

Target Fact:
A has beaten their brother in races due to their competitive nature.

Retrieved Timelines (we remove refinement and present the raw memories for clarity):
Color blue = memories originally retrieved by the retriever.
●​ [A is always looking to challenge themselves and improve their skills], [A is determined and hard-working,
as seen in their desire to win a race and their dedication to studying].
●​ [A is competitive and recently beat their brother in a race.], [B brings up a past event where A beat their
brother in a race, highlighting their competitive nature.], [A reminisces about a past race with their brother
and a conversation with Tyler about a personal issue.]
●​ …

Theanine’s Wrong Response:


B: Yes, I remember that! It's all part of the journey, the wins and the losses. It's great to see how far you've come
since then.

Figure 21: T HEANINE fails to pass TeaFarm (Example 2) - Due to sub-optimal timeline utilization during RG.

8655
Prompt for Relation-aware Memory Linking

Your task is to find the relation between [Sentence A] and [Sentence B].
Keep in mind that [Sentence A] happened before [Sentence B].
The dialogues where each of the sentence is originated from are provided to help your reasoning.

First, identify if the relation holds among the following six relations:
1. Changed: when events in [Sentence A] changed to events in [Sentence B]
2. Cause: when events in [Sentence A] caused events in [Sentence B]
3. Reason: when events in [Sentence A] are due to events in [Sentence B]
4. HinderedBy: when events in [Sentence B] can be hindered by events in [Sentence A], and vice
versa
5. React: when, as a result of events in [Sentence A], the subject feels as mentioned in [Sentence B]
6. Want: when, as a result of events in [Sentence A], the subject wants events in [Sentence B] to
happen.

Then, if the relation does not belong to any of the relations from 1 to 6, choose between the
following two options:
7. SameTopic: when the specific topic addressed in [Sentence A] is also discussed in [Sentence B]
8. None: when [Sentence A] and [Sentence B] are irrelevant

- For relations from 1 to 7, choose them only if there is clear evidence that matches the description
of the relation. Otherwise, just choose "None" without making excessive inferences beyond the
given sentence.
- Pay attention to who the subject of each sentence is.
- Do not confuse the roles of [Sentence A] and [Sentence B] when determining the relationship.

Follow the format of this example output:


<OUTPUT>
- Explanation: (your_explanation)
- Relation: (predicted_relation)

Now, read the two dialogues and find the relation between [Sentence A] and [Sentence B].

<INPUT>
[Dialogue for Sentence A]:
{dialogue1}
[Dialogue for Sentence B]:
{dialogue2}

[Sentence A]: {sentence1}


[Sentence B]: {sentence2}

<OUTPUT>

Figure 22: The prompt for the Relation-aware memory linking.

8656
Prompt for Context-aware Timeline Refinement

Given Timelines, which are structured in this format: [Event A] - (relation) - [Event B] ...,

your job is to naturally transform each timeline into useful information that can help an

language model to generate a proper next response for Current Dialogue.



These are the explanation of each relation type:


1. Changed: when events in [Event A] changed to events in [Event B]


2. Cause: when events in [Event A] caused events in [Event B]


3. Reason: when events in [Event A] are due to events in [Event B]


4. HinderedBy: when events in [Event B] can be hindered by events in [Sentence A], and vice

versa


5. React: when, as a result of events in [Event A], the subject feels as mentioned in [Event B]


6. Want: when, as a result of events in [Event A], the subject wants events in [Event B] to

happen


7. SameTopic: when the specific topic addressed in [Event A] is also discussed in [Event B]



If a given relation is not proper, naturally connect them without using that relation.



Current Dialogue:


{current_dialogue_context}



Timelines:


{input_path}



Your Outputs (only the transformed timelines):

Figure 23: The prompt for the context-aware timeline refinement.

Prompt for Timeline-augmented Response Generation


Generate the most plausible next response based on the current conversation. You can refer to
the memory, but you should ignore the memory if it mislead the next response. Do not put too
much information in the next response. 


Your response should follow the style of the conversation.


Memory:

{refined_timelines}


Current conversation:

{current_dialogue_context}


Next Response:

{speaker_tag}:
Figure 24: The prompt for the timeline-augmented response generation.

8657
Prompt for Memory Update (Baseline)

Compare the 'memory' and 'summary' of the two given sentences according to the following
instructions, and output which of the following relations the two sentences have.

-'PASS': When the information in 'memory' already contains the information in 'summary', that is,
it is duplicated in content.
-'CHANGE': When the information from 'summary' has been changed to 'memory'.
-'REPLACE': When 'summary' has more information than the 'memory' without missing any
details in 'memory'.
-'APPEND': When 'summary' has new information or different information compared to
'memory'.
-'DELETE': When the situation in 'memory' has been completed or solved in 'summary'.

Tips: Most of the relations are likely to be 'APPEND'. When choosing other relations, explain with
clear evidence.

Some examples are as follows.


1. Example of "PASS"
memory: "Not sick"
summary: "Doesn't have any particular health issues"
Explanation: The information of 'not being sick' in the 'memory' already sufficiently includes the
information of 'being healthy' in the 'summary'. So the 'summary' does not need to be added.
2. Example of "CHANGE"
memory: "Doesn't have any particular health issues"
summary: "Had back surgery"
Explanation: The information in 'memory' is changed from not having health issues to having a
back surgery.
3. Example of "REPLACE"
memory: "likes listening classic music"
summary: "likes classic music and goes to concerts every week"
Explanation: The 'summary' has more information than 'memory' while also containing the
information in 'memory'. So the 'memory' can be replaced by 'summary'.
4. Example of "APPEND"
memory: "Goes to the gym"
summary: "Body is sore from exercise"
Explanation: The 'summary' contains new information compared to 'memory'.
5. Example of "APPEND"
memory: "wakes up early"
summary: "likes to drink coffee in the morning"
Explanation: The 'summary' and 'memory' contains different information.
6. Example of "DELETE"
memory: "Had sore throat"
summary: "Throat is fully recovered"
Explanation: The sore throat from the 'memory' has been recovered according to the 'summary'.

Now write the relations and explanation between the following memory and summary.
memory: {memory}
summary: {summary}

Figure 25: The prompt for the memory updating mechanism in baselines (i.e., + Memory Update).

8658
Prompt for G-Eval: Helpfulness of Retrieved Memories
Your task is to choose a more helpful MEMORY based on the below criterion.


CRITERION:

Helpfulness - A more helpful MEMORY should contain speaker information that is related to
CURRENT DIALOGUE CONTEXT, enabling the {speaker} to respond in an appropriate
context to the last utterance of the CURRENT DIALOGUE CONTEXT.


The output format should be as follows:

Explanation: (a brief explanation)

Choice: (answer with "1", "2", or "tie")


Now choose the MEMORY that has better Helpfulness.


CURRENT DIALOGUE CONTEXT:

{current_dialogue_context} 


MEMORY 1:

{memory1}


MEMORY 2:

{memory2}


YOUR OUTPUT:
Figure 26: The prompt for the G-Eval: Helpfulness of Retrieved Memories.

8659
Prompt for Generating counterfactual QA in TeaFarm

The summaries below are summarized from conversations between two speakers throughout
multiple encounters and are listed in chronological order.

First, read these summaries and capture the development of facts about the speakers.
Then, pretend that you are one of the speakers and want to test whether a chatbot trained to
represent the other speaker can correctly remember past conversations.
You do so by asking counterfactual questions, i.e., tricky questions made with non-factual
statements.

Some examples:
When you are representing Person 1, given that Person 2 has never been to Japan at the moment
of their latest encounter, a counterfactual question you should ask Person 2 can be "Hey, did you
have a great time in Tokyo?".
When you are representing Person 2, given that Person 1 once mentioned that they bought a new
house in NYC three months ago, a counterfactual question you should ask Person 1 can be "So you
are still hesitating to buy that house in NYC you've been talking about. Right?.

Now, generate two counterfactual questions, one from the perspective of {speaker1} and one from
{speaker2}, based on the summaries, and also generate correct answers with which a chatbot that
perfectly remembers past conversations should answer.
Also, please insert the speaker tags ("{speaker1}:" and "{speaker2}:") and avoid them in the
questions/answers themselves.

[Summaries from conversations listed in chronological order]


{summaries}

[Question 1]
{speaker1}:

Figure 27: The prompt for generating counterfactual QA in TeaFarm.

Prompt for Generating session 6 in TeaFarm

You will be given a [Past session dialogue] of two individuals.


Create a current conversation consisting of 10-15 utterances that might occur after some time has
passed from the [Past session dialogue].
Your conversation should end with the given [Last utterance].
Do not confuse the speaker of the [Last utterance].

[Last utterance]
{Question}

[Past session dialogue]


{session5}

Now, create your conversation be ending with the [Last utterance].


[Current session dialogue]

Figure 28: The prompt for generating session 6 in TeaFarm.

8660
Prompt for Evaluating model responses in TeaFarm

Below is a question, a correct answer, and an answer generated by a chatbot ("[Chatbot's


Answer]").

[Question]
{query}
[Answer]
{answer}
[Chatbot's Answer]
{response}

Evaluate whether the chatbot answers the question correctly.


If the chatbot's answer is contradictory to the given answer, it is "Incorrect".
If the chatbot's answer aligns with the given answer, it is "Correct".
Use the following format:
[Evaluation]
Evaluation: <your analysis>
Result: {"Correct" or "Incorrect"}

-Your Task-
[Evaluation]

Figure 29: The prompt for evaluating model response in TeaFarm.

8661

You might also like