2025.4 (NAACL) - Towards Lifelong Dialogue Agents Via Timeline-Based Memory Management
2025.4 (NAACL) - Towards Lifelong Dialogue Agents Via Timeline-Based Memory Management
Memory Management
Session 1 Session 4
focus on getting rid of outdated memories to im- . enjoys cruising in the is interested in learning
prove retrieval quality, we argue that such mem- car while listening to how to swim and program,
punk music ... already practicing programing.
ories provide rich, important contextual cues
for RG (e.g., changes in user behaviors) in long- Current Dialogue Context Retrieved
ing the evolution or causality of relevant past (b) Entire dialogue history as model input:
events. Along with T HEANINE, we introduce Nope, I've never been afraid. Do you have any fears
TeaFarm, a counterfactual-driven evaluation that prevent you from enjoying certain activities? X
scheme, addressing the limitation of G-Eval
(c) Timeline-augmented response generation (ours):
and human efforts when assessing agent perfor- + +
+
+ + Not being able to swim really took the fun out of my
mance in integrating past memories into RG. A + +
+
I
Previous Dialogue
“B likes spicy food wanted us to live together, but ...... Now, we go for a
Session t but ...... walk on Manhattan bridge every night.
Figure 2: The overview of T HEANINE. Left: Linking new memories to the memory graph after finishing a
dialogue session; Right: Memory timeline retrieval, refinement, and response generation in a new dialogue session.
tion to prevent such information loss,1 this often and cause-effect commonsense relations (Hwang
leads to biased attention toward the latest user input et al., 2021). Supported by such linking structure,
(Figure 1 (b)), ignoring relevant contexts from the in memory retrieval for RG (Phase II-1), we go
past (Liu et al., 2024). These findings highlight two beyond conventional top-k retrieval and further ob-
main challenges towards lifelong dialogue agents - tain the complete timelines to avoid missing out on
(i) Memory construction: how to store large-scale important memories that have low textual overlap
past interactions effectively without removing old with current conversation (Tao et al., 2023). Lastly,
memories? (ii) Response generation: within the to tackle the discrepancy between off-line memory
growing memory span, how to identify relevant construction and online deployment, T HEANINE
contextual cues for generating proper responses? uses an LLM to refine retrieved timelines (Phase
Motivated by these, we propose addressing the II-2) based on current conversation, such that they
above two challenges separately yet complemen- provide tailored information (Chae et al., 2023) for
tarily, by (i) discarding memory update to avoid RG (Phase III). Our contributions are two-fold:
information loss, and preserving relevant memo-
• To achieve lifelong dialogue agents, we
ries on the timeline in a linked structure; and (ii)
present T HEANINE, an LLM-based frame-
retrieving the timeline as a whole to better catch
work with a relation-aware memory graph
relevant memories within the growing search span.
and timeline augmentation for long-term
We present T HEANINE,2 a framework for facil-
conversations. T HEANINE outperforms rep-
itating lifelong dialogue agents.
resentative baselines across automatic, LLM-
Starting from memory construction (Phase I), in- based, and human evaluations of RG. Also,
stead of stacking raw memory sentences as-is (Xu we confirm that T HEANINE leads to higher
et al., 2022a), which may affect memory retrieval retrieval quality, and its procedures align with
and also response quality due to the unstructured human preference. To our knowledge, we are
format of information (Mousavi et al., 2023; Chen the first to model the use of timelines (i.e.,
et al., 2023), T HEANINE stores memories in a di- linked relevant memories) in memory man-
rected graph. In this graph, inspired by how hu- agement and response generation.
mans naturally link new memories to existing ones
of relevant events based on their relation (Bartlett, • The lack of golden mapping between conver-
1995), memories are linked using their temporal sations and reference memories poses a chal-
lenge in assessing memory-augmented agents.
1
For instance, GPT-4o and Llama 3.1 have context win- We present TeaFarm, a counterfactual-driven
dows of 128K tokens (OpenAI, 2024a; MetaAI, 2024).
2
L-theanine is an amino acid found in green tea that has pipeline evaluating agent performance in ref-
been linked to memory improvement (Nguyen et al., 2019). erencing the past without human intervention.
8632
2 Methodologies adopt a relation-aware memory linking, where an
edge between two memories is encoded with their
We present T HEANINE, a framework for lifelong
cause-effect commonsense relation r ∈ R, along
dialogue agents inspired by how humans store and
w/ the temporal order. In practice, we adopt the
retrieve memories for conversations (Figure 2):
commonly used relations defined by Hwang et al.
2.1 Memory Graph Construction (Phase I) (2021), including HinderedBy, Cause, Want, and
4 more (Appendix B.1).
To manage large-scale memories and facilitate
We start by determining the relation between
structured information for RG (Mousavi et al.,
mnew and each associative memory. Formally, for
2023; Chen et al., 2023), we approach memory
each pair of mnew and m ∈ Ma , the LLM assigns
management using a memory graph G:
a relation r ∈ R based on their event, time and their
G = (V, E) (1) origin conversations:
V = {m1 , m2 , ..., m|V | } (2)
Ma∗ = {mi ∈ Ma | Υ(mi , mnew ) ∈ R} (6)
m = (event, time) (3)
E = {⟨mi , rij , mj ⟩|mi , mj ∈ V ∧ rij ∈ R} (4) where Υ(·, mnew ) ∈ R indicates that the given
R = {Cause, Reason, Want, ..., SameTopic} (5) memory is assigned with an r ∈ R with mnew ,4
and such assigned memories are defined as Ma∗ .
In G, vertices V are memories m summarized from We then proceed to link mnew to the graph. We
the conversations. Each memory m = (event, time) first locate every connected component Ci ⊂ Gt
consists of an event3 and the time it is formed (sum- that contains at least one m ∈ Ma∗ , as shown in
marized). Each directed edge e ∈ E between two Figure 3 (a) and (b):
connected m indicates their temporal order and
their cause-effect commonsense relation r ∈ R: C = {Ci ⊂ Gt | V(Ci ) ∩ Ma∗ ̸= ∅ } (7)
At the end of dialogue session t, T HEANINE
starts linking each new memory mnew summarized where C is the collection of those C and V(·) rep-
from session t to the memory graph Gt . resents “vertices in”. Then, we link mnew to the
most recent5 m ∈ Ma∗ in each Ci ⊂ C (Figure 3
Phase I-1: Identifying associative memories for (c)). Memories Mlinked that are linked to mnew is
memory linking. Following how humans link defined as follows:
new memories to existing ones that are related to a
similar event/topic, i.e., the associative memories, Mlinked = {Ω(V(Ci ) ∩ Ma∗ ) | Ci ⊂ C} (8)
T HEANINE starts by identifying these associative
memories from the memory graph Gt . where Ω(·) indicates “the most recent memory in”.
Formally, given a newly-formed memory mnew
waiting to be stored, the associative memories Ma (a) (b) (c)
of mnew is defined as the set of mi ∈ Gt having Old Recent Old Recent Old Recent
8633
2.2 Timeline Retrieval and Timeline We then sample n raw timelines τ from T .6 Repeat-
Refinement (Phase II) ing7 Phase II-1 for all retrieved top-k memories,
Thanks to the constructed memory graph, T HEA - we collect a set of retrieved raw memory timelines
NINE can proceed to augment RG with timelines T = ∪ T , where |T| = k ∗ n.
of relevant events, addressing the information loss Phase II-2: Context-aware timeline refinement.
in conventional memory management (Figure 1). Although we have constructed the memory graph
With Gt+1 , T HEANINE performs the following using temporal and commonsense relations to im-
steps for RG in session t + 1: prove its informativeness, directly applying re-
Preparation: Top-k memory retrieval. During trieved timelines for RG can be suboptimal (RQ3,
the conversation, using the current dialogue con- Section 4), because graph construction does not
text D = {ui }ni=1 of n utterances u as query, we take current conversation into consideration, i.e.,
retrieve top-k memories Mre = {mre1 , ..., mrek }. they are constructed off-line.
In this phase, T HEANINE tackles such a discrep-
Phase II-1: Retrieving and untangling raw mem- ancy between off-line memory construction and
ory timelines. We wish to also access memories online deployment (i.e., ongoing conversation) via
centered around Mre . Formally, given mre ∈ Mre , a context-aware timeline refinement. Motivated
we further collect the connected component Cre ⊂ by how LLMs can refine their previous genera-
Gt+1 that contains mre via the linked structure. tion (Madaan et al., 2024). We leverage LLMs
Since this collection of memories (i.e., Cre ) can to refine raw timelines into a rich resource of in-
be “tangled up” together (i.e., connected in a com- formation crafted for the current conversation, by
plex manner) due to the graph structure, we proceed removing redundant information or highlighting in-
to untangle it into several memory timelines, each formation that can come in handy. Formally, given
representing a series of events about mre that starts the current dialogue D and retrieved raw timelines
out similarly yet branches into slightly different T, an LLM tailors τ ∈ T into refined timelines TΦ :
development. For that, we first locate the earliest
memory in Cre as a starting point mstart for all TΦ = {argmax PLLM (τΦ |D, τ ) | τ ∈ T} (11)
τΦ
timelines, as shown in Figure 4 (left).
All refined timelines TΦ are then used to augment
mstart = Θ(V(Cre )) (9)
the response generation. We provide the pseudo
where Θ indicates “the oldest memory in”. algorithm for Phase II in Algorithm 2.
DuLeMon (Xu et al., 2022b) and CareCall (Bae compresses all of them into short events, user por-
et al., 2022) are proposed for long-term conversa- traits (behavioral patterns, emotion, etc.) and user-
tions in Mandarin and Korean. Recently, Jang et al. bot relation. It then selects compressed memories
(2023) release a new dataset, Conversation Chron- to augment response generation.
icles (CC). Unlike MSC, CC augments speakers
with defined relationships, such as “employee and 3.3 Models and Implementation Details
boss”. Apart from these open-domain datasets, the Large language models. In all experiments,
Psychological QA,8 addresses long-term conversa- including baselines, we adopt gpt-3.5-turbo-
tions under clinical scenarios in Mandarin. 0125 (OpenAI, 2023) for (i) memory summariza-
We opt for MSC and CC for evaluation to fo- tion (Table 6), (ii) memory update, and (iii) re-
cus on English conversations, leaving multilingual sponse generation. Temperature is set to 0.75.
and domain-specific conversations (e.g., DuleMon, Retrievers. We use text-embedding-3-small (Ope-
CareCall, and Psychological QA) to future work. nAI, 2024b) to calculate text similarity for settings
involving retrievers. In the identification of top-j
3.2 Baselines
associative memories (Phase I-1) and top-k mem-
To evaluate T HEANINE, in addition to naive base- ory retrieval (Phase II), we set j and k to 3. For
lines that utilize all past dialogues or memories, we the “Memory Retrieval” baseline, we set k = 6
incorporate the following settings: following Xu et al. (2022a).
Memory Retrieval. Following Xu et al. (2022a), Dialogue sessions. We use sessions 3-5 of MSC
we use a retriever to retrieve memories relevant to and CC for evaluations, as all methods are almost
the current dialogue context to augment RG. identical in session 1 ∼ 2 (no memory to update).
Memory Update. We utilize LLMs to implement
a widely used updating algorithm proposed by 4 Evaluation Scheme 1: Automatic and
Bae et al. (2022) at the end of each dialogue ses- Human Evaluations
sion. This algorithm includes functionalities such
as Change, Replace, Delete, Append, and more To evaluate T HEANINE’s responses in long-term
(see Appendix H). conversations, we follow common practices and
RSum-LLM. An LLM-only generative method conduct 3 types of evaluations: (i) Automatic evalu-
that recursively summarizes and updates the mem- ations; (ii) G-Eval (Liu et al., 2023), an LLM-based
ory pool, generating responses w/o a retrieval mod- framework commonly used to evaluate LMs’ gener-
ule (Wang et al., 2023). ation; (iii) human evaluation. We now present sev-
MemoChat. Proposed by Lu et al. (2023), it lever- eral key findings (details, prompts, and interfaces
ages LLMs’ CoT reasoning ability to (i) conclude of evaluations in Scheme 1 are in Appendix E):
important memories from past conversations in a
(Finding 1) T HEANINE outperforms baselines in
structured topic-summary-dialogue manner, (ii) se-
response generation. Table 1 presents the agent
lect memories, and (ii) generate responses.
performance in RG regarding both overlap-based
COMEDY. Proposed by Chen et al. (2024b), it
and embedding-based metrics: Bleu-4 (Papineni
uses LLMs to summarize session-level memories,
et al., 2002), Rouge-L (Lin, 2004), Mauve (Pillutla
8
https://www.xinli001.com/ et al., 2021), and BertScore (Zhang et al., 2020).
8635
Settings / Metrics B-4 R-L Mauve Bert
Helpfulness of Retrieved Memories (given same context)
T HEANINE (Ours) 4.32 19.03 41.52 87.64 Memory Retrieval 52.0% 48.0% 48.3% 40.4% 11.4%
Broken Down, Shuffled Timeline 4.15 18.70 38.49 87.61 COMEDY 61.9% 38.1% 44.1% 41.4% 14.4%
Memory Retrieval 3.43 18.06 22.11 87.27 Legends: THEANINE wins Tie Baseline wins
Table 2: Performance of our ablations (avg. of datasets). Figure 5: Human- (right) and machine-based (left) head-
to-head comparisons between ours and baselines regard-
ing the helpfulness of retrieved memories.
Across both datasets, T HEANINE, achieves superior
response quality than various baselines. Although,
compared to Memory Retrieval, T HEANINE scores rent dialogues as queries for memory retrieval,
slightly lower in overlap-based metrics (i.e., B-4 Figure 5 shows head-to-head comparisons (ours
and R-L) in MSC, it largely outperforms Memory vs. baselines) regarding whose retrieved memories
Retrieval in embedding-based metrics. Interest- more effectively benefit RG. We observe higher
ingly, including ours, methods without memory win rates for T HEANINE in all comparisons, es-
update generally yield higher scores, justifying our pecially in human evaluations. This suggests that
proposal towards an update-, removal-free memory our method can facilitate more helpful memory
management for lifelong dialogue agents. augmentation for response generation.
In addition to helpfulness, objectively measuring
(Finding 2 & 3) All phases contribute to perfor- retrieval accuracy is crucial. Since existing datasets
mance; retrieving the timeline as a whole brings of long-term conversations do not provide a golden
large improvement over conventional retrieval. mapping between dialogue contexts and memories
To gain deeper insights into our design, we investi- (i.e., golden memories for retrieval), we identify
gate the impact of removing T HEANINE’s relation- 50 dialogue contexts (i.e., test instances) that re-
awareness during memory linking (Phase I-2) and quire a past memory for RG, and manually measure
Timeline Refinement (Phase II-2). Also, to objec- the retrieval accuracy of different agents. The re-
tively assess whether T HEANINE’s retrieval (i.e., sults shown in Table 3 indicate that T HEANINE and
retrieving the timeline as a whole) improves re- its ablations demonstrate higher retrieval accuracy
trieval quality, we include a setting where retrieved than baselines, and the ranking here aligns with
timelines are broken down into randomly ordered Table 1 and success rates in Table 4.
events such that retrieved memories during RG are
Methods (Agents) Golden Memory is Retrieved/collected (%)
in the same format as conventional top-k retrieval.
Memory Retrieval 68.00
In Table 2, we observe a ranking in terms of con- + Memory Update 64.00
tribution to performance: relation-aware linking MemoChat 56.00
COMEDY 48.00
> retrieving timeline as a whole > timeline re- T HEANINE (Ours) 72.00
finement. This observation confirms the efficacy
of constructing a memory graph with causal rela- Table 3: Human evaluation of the accuracy of memory
tions. Moreover, utilizing this graph structure to retrieval (we examine 50 test instances).
collect timelines of relevant events yields higher
RG quality than conventional retrieval, despite the
(Finding 5) Humans confirm that T HEANINE
smaller k (3 vs. 6) in initial retrieval. Refining
yields responses better entailing past interac-
timelines shows smaller performance gains, sug-
tions. Now that the helpfulness of T HEANINE’s
gesting room for improvement in applying them
retrieved memories is validated, we proceed to in-
for RG. We leave it to future work.
vestigate whether such helpful memories contribute
towards reliable lifelong human-agent interaction.
(Finding 4) Humans and G-Eval reveal that For that, we further ask a group of workers to
T HEANINE leads to higher retrieval quality re- specifically judge whether agent responses entail,
garding both helpfulness and accuracy. Be- contradict, or are neutral to the past via majority
yond agent responses, we further investigate how voting. In Figure 6, T HEANINE not only leads to
different memory construction methods affect the a small number of contradictory responses (4%)
quality of memory retrieval. Given the same cur- but also demonstrates the largest percentage (68%;
8636
out of 100) of responses that entail past conversa- versations and correct memories for retrieval. Al-
tions, significantly outperforming baselines. We though we may resort to G-Eval by feeding eval-
argue that it is because our timeline-based approach uator LLMs (e.g., GPT-4) the entire past history
elicits memories better at representing past interac- and prompt it to determine whether a response cor-
tions between speakers, thus leading to responses rectly recalls the past, the evaluation can be largely
more directly aligned with the past. This alignment limited by the performance of the evaluator LLM
is important for dialogue agents to maintain long- itself (Kim et al., 2024b).
term intimacy with users (Adiwardana et al., 2020). To overcome this, along with T HEANINE, we
Furthermore, such entailing and non-contradictory present TeaFarm, a human-free counterfactual-
nature of T HEANINE’s responses highlights its po- driven pipeline for evaluating memory-augmented
tential for applications in specialized domains, such response generation in long-term conversations.
as personalized agents for clinical scenarios, where
entailment between agent responses and users’ past
information (e.g., electrical health records or pre-
5.1 Testing Dialogue Agents’ Memory via
vious consulting sessions) is crucial for diagnostic
Counterfactual Questions
decison-making (Tseng et al., 2024).
Entail Neutral Contradict In TeaFarm, we proceed to “trick” dialogue agents
Memory Retrieval
+ Memory Update
24%
34%
70%
64%
6%
2%
into generating incorrect responses, and agents
Rsum-LLM 42% 52% 6% must correctly reference past conversations to avoid
MEMOCHAT
COMEDY
44%
42%
54%
50%
2%
8%
being misled by us. Specifically, we talk to the
THEANINE (ours) 68% 28% 4% dialogue agent while acting as if a non-factual
statement is true (thus counterfactual). Figure 8
Figure 6: Human evaluations regarding to what extent
the agent responses entail past conversations. presents some examples of counterfactual ques-
tions and the corresponding facts.
As a side note, Memory Update yields fewer
contradictory responses (2%), indicating a potential Facts (at this moment) Counterfactural Questions
trade-off between (i) removing outdated memories Speaker B has never been to A: Hey, did you have a great
to prevent contradiction and (ii) preserving them to Japan. time in Tokyo?
get richer information for RG (Kim et al., 2024a). Speaker A bought a new house B: So you are still hesitating to
in NYC three months ago. buy that house in NYC you've
been talking about, right?
(Finding 6) Humans agree with T HEANINE’s in-
Speaker B does not own a car. B: Hey, do you remember when
termediate procedures. As reported in Figure 7, we sang karaoke in my car?
judges largely agree (92%) that T HEANINE prop-
erly assigns cause-effect relations to linked memo- Figure 8: Examples of counterfactual questions.
ries, which explains its contribution to performance.
Also, they agree that timeline refinement success-
fully elicits more helpful information (100%; 100 In practice (Figure 11), when we want to eval-
samples in total) for RG. Examples of T HEANINE’s uate an agent that has been interacting with the
phases and RG are in Appendix G. user for sessions, we first (1) collect all past con-
versations and summarize them session by session.
Agree Disagree
Then, we (2) feed a question generator LLM9 the
Appropriateness (Memory Linking) 92% 8%
+ Memory Update
All Memories + D
Ours
Ours (shuffled)
(w/o relation)
All Memories + D
+ Memory Update
All Memories + D
tive when both performance and cost are taken into Rsum
8643
(In the other two evaluations, the samples are sometimes make the length of input tokens too long
randomly selected). The interface provided such that the agent cannot properly utilize key infor-
to AMT workers, which includes detailed in- mation provided. We believe this can be resolved
structions for human evaluation is shown in to an extent via dedicated prompt (i.e., the prompt
Figure 14. for RG) engineering. We leave this to future work.
We provide session-specific results for automatic • COMEDY (baseline): We adopt the original
evaluations in Table 9. prompt from Chen et al. (2024b).
T HEANINE (ours) ✓ ✓
w/o Relation-aware Linking
w/o Refinement ✓
Shuffled Timeline ✓ ✓ ✓
8645
Algorithm 1 Memory Graph Construction (Phase I)
Require: Memory graph Gt = (V t , E t )
Require: New memories Mnew = {mnew1 , ..., mnewN }
Require: Set of relations R = {Cause, Reason, Want, ..., SameTopic}
Ensure: Memory graph Gt+1 = (V t+1 , E t+1 )
(
ri,j , if mi is assigned with ri,j ∈ R with mj
1: Υ(mi , mj ) =
None, otherwise
2: Ω(V ) = (the most recent memory m ∈ V )
3: Et+1 ← Et
4: for mnew ∈ Mnew do
5: Ma ← {mi ∈ V t | mi has top-j similarity with mnew }
6: Ma∗ ← {mi ∈ Ma | Υ(mi , mnew ) = r for r ∈ R}
7: C ← {Ci | Ci connected component of Gt s.t. V(Ci ) ∩ Ma∗ ̸= ∅ }
8: Mlinked ← {Ω(V(Ci ) ∩ Ma∗ ) | Ci ∈ C}
9: Enew ← {⟨mi , Υ(mi , mnew ), mnew ⟩ | mi ∈ Mlinked }
10: Et+1 ← Et+1 + Enew
11: end for
12: V t+1 ← V t + Mnew
13: Gt+1 ← (V t+1 , E t+1 )
14: return Gt+1
8646
Conducting Long-term Conversations Collecting Summaries Generating Counterfactual QAs
...
Tokyo?
...
B time in
A has never been to Japan.
Sounds great! Let’s grab I told you I have
S ynthesizing Dialogue Session 6 Asking the Counterfactual Question Measuring Answer Correctness
Session 5 + Counterfactual Q
Feed the model with the
Correct Answer
synthesized session 6
I told you I have never been to ...
System Response
Yes, I has a great time there.
My wife and I went to a sashimi ...
...
...
time in Tokyo?
GPT-4
8647
We are surveying qualities for relation between sentence A and B.
Specifically, you will be given two sentences, A and B, along with a relation between them. You will be asked to determine if the
relation between the two sentences is properly linked. In other words, the evaluation criteria is based on the appropriateness of
the relation between the two sentences.
Relations:
1. Changed: when events in [Sentence A] changed to events in [Sentence B]
2. Causes: when events in [Sentence A] caused events in [Sentence B]
3. Reason: when events in [Sentence A] are due to events in [Sentence B]
4. HinderedBy: when events in [Sentence B] can be hindered by events in [Sentence A], and vice versa
5. React: when, as a result of events in [Sentence A], the subject feels as mentioned in [Sentence B]
6. Want: when, as a result of events in [Sentence A], the subject wants events in [Sentence B] to happen
7. SameTopic: when the specific topic addressed in [Sentence A] is also discussed in [Sentence B]
8. None: when [Sentence A] and [Sentence B] are irrelevant
Guidelines:
1. There are four choices: Definitely Disagree / Agree and Slightly Disagree / Agree
2. Please trust your instincts and choose Definitely if you would feel more confident giving one response, versus the other one.
Sentence A
${sentence_a}
Relation
${relation}
Sentence B
${sentence_b}
Q1. Do you think the relation between the two sentences is properly linked?
You will be given a sequence of two sentence connected with one relation, and a refined version of it. Your task is to judge whether
the refinement was done appropriately, such that the refined sentences can serve as an useful information source for you to make a
next response based on the dialogue context.
Relations:
1. Changed: when events in [Sentence A] changed to events in [Sentence B]
2. Causes: when events in [Sentence A] caused events in [Sentence B]
3. Reason: when events in [Sentence A] are due to events in [Sentence B]
4. HinderedBy: when events in [Sentence B] can be hindered by events in [Sentence A], and vice versa
5. React: when, as a result of events in [Sentence A], the subject feels as mentioned in [Sentence B]
6. Want: when, as a result of events in [Sentence A], the subject wants events in [Sentence B] to happen
7. SameTopic: when the specific topic addressed in [Sentence A] is also discussed in [Sentence B]
8. None: when [Sentence A] and [Sentence B] are irrelevant
Guidelines:
1. There are four choices: Definitely Disagree / Agree and Slightly Disagree / Agree
2. Please trust your instincts and choose Definitely if you would feel more confident giving one response, versus the other one.
Dialogue Context
${dialogue}
After Refinement
${after_refinement}
Q1. Do you think that the sentence after refinement is appropriately refined considering the dialogue context and its
relations?
8648
We are surveying qualities for response from a given dialogue context.
Specifically, you will be given speaker information in chronological order, a dialogue context, and a response to the last utterance in
the dialogue context. You will be asked to judge the quality of the response to the last utterance.
Criteria:
1. Entail : When the response to the last utterance in dialogue context appropriately reflects given information.
2. Neutral : Although the response does not reflect speaker information, it does not contradict them either
3. Contradictory : when the response to the last utterance in dialogue context contains statement that contradicts the "most
up-to-date information about that statement."
Dialogue Context
${dialogue}
Response
${response}
Q1. Base on the criteria, select an option that fits the response.
Figure 14: Interface for human evaluation regarding referencing past conversations in responses.
Figure 15: Interface for human evaluation regarding the helpfulness of retrieved memories.
8649
Example 1 - [Changed]
[Before Linking]
Memory 1: Classmates A was initially hesitant about following Classmates B's advice.
Memory 1’s Contextual Background:
Classmates A: Thank you for the advice, but I'm not sure if I should follow it.
Memory 2: Classmates A was initially hesitant but received positive responses after starting the blog.
Memory 2’s Contextual Background:
Classmates A: Yeah, it was scary at first, but the response has been really positive.
[After Linking]
Classmates A was initially hesitant about following Classmates B's advice - [Changed] - Classmates A was initially
hesitant but received positive responses after starting the blog
Example 2 - [Cause]
[Before Linking]
Memory 1: The Child feels it is unfair that they have to do certain chores because the Parent is too tired.
Memory 1’s Contextual Background:
Child: But Mom, it's not fair that we have to wash the dishes because you're too lazy to do it.
Memory 2: The Parent acknowledges being lazy about washing dishes and promises to contribute more to keeping
the home clean.
Memory 2’s Contextual Background:
Parent: I realized how lazy I've been lately, especially when it comes to washing the dishes.
Parent: From now on, I promise to do my fair share and contribute more to keeping our home clean and organized.
[After Linking]
The Child feels it is unfair that they have to do certain chores because the Parent is too tired - [Cause] - The Parent
acknowledges being lazy about washing dishes and promises to contribute more to keeping the home clean
Example 3 - [Reason]
[Before Linking]
Memory 1: Speaker A has multiple sons, at least one of them is in a relationship with a Spanish girlfriend.
Memory 1’s Contextual Background:
Speaker A: One of my sons just told me that he has a Spanish girlfriend now.
Speaker A: . . . I'm visiting my son that lives in Spain next month. This will give me a chance to finally meet his
girlfriend of three years now!
Memory 2: Speaker A is interested in learning Spanish and Portuguese before her trip.
Memory 2’s Contextual Background:
Speaker A: Sounds great! I'm already very excited about my trip to Spain, and now I get to visit you in Lisbon! I need
to brush up on my Spanish and also start studying Portuguese.
[After Linking]
Speaker A has multiple sons, at least one of them is in a relationship with a Spanish girlfriend - [Reason] - Speaker A is
interested in learning Spanish and Portuguese before her trip
8650
Example 4 - [HinderedBy]
[Before Linking]
Memory 1: Speaker B is currently re-reading 'Redwall' by Brian Jacques, which was a favorite book growing up.
Memory 1’s Contextual Background:
Speaker B: I'm recently re-reading Redwall by Brian Jacques! It was one of my favorites growing up. Have you ever
read it?
Memory 2: Speaker B has been busy with a new painting and has not had time to read.
Memory 2’s Contextual Background:
Speaker B: I think I would but I have been too busy with a new painting to get in some reading.
[After Linking]
Speaker B is currently re-reading 'Redwall' by Brian Jacques, which was a favorite book growing up - [HinderedBy] -
Speaker B has been busy with a new painting and has not had time to read
Example 5 - [React]
[Before Linking]
Memory 1: The Mentee hopes to inspire others to join the cause of gender equality and fighting discrimination.
Memory 1’s Contextual Background:
Mentee: I agree. We need more people advocating for gender equality and fighting against discrimination.
Memory 2: The Mentor acknowledges the Mentee’s work in advocacy for women and girls and praises their
dedication to their values.
Memory 2’s Contextual Background:
Mentor: . . . I think this is a great reflection of the work that you've done in advocating for women and girls.
Mentor: Absolutely. And I have no doubt that your dedication to these principles will serve you well in this new job.
[After Linking]
The Mentee hopes to inspire others to join the cause of gender equality and fighting discrimination - [React] - The
Mentor acknowledges the Mentee’s work in advocacy for women and girls and praises their dedication to their values
8651
Example 6 - [Want]
[Before Linking]
Memory 1: Neighbors A and B don't know each other well and want to spend more time together.
Memory 1’s Contextual Background:
Neighbors A: . . . I feel like I don't know you well enough.
Neighbors A: Well, maybe we could hang out once a week or something.
Memory 2: Neighbor A enjoys spending time in Neighbor B's cozy home and wants to hang out more often.
Memory 2’s Contextual Background:
Neighbors A: It's okay, I love spending time in your cozy home. And speaking of spending time, can we hang out more
often?
[After Linking]
Neighbors A and B don't know each other well and want to spend more time together - [Want] - Neighbor A enjoys
spending time in Neighbor B's cozy home and wants to hang out more often
Example 7 - [SameTopic]
[Before Linking]
Memory 1: Speaker A enjoys reading sci-fi and mysteries, while Speaker B prefers fantasy books.
Memory 1’s Contextual Background:
Speaker A: I prefer sci-fi but here recently I have been craving a god mystery.
Speaker B: . . . I mostly read fantasy books myself.
Memory 2: Speaker B enjoys reading the Odd Thomas and Dark Tower series and finds inspiration for their artwork
during nature walks.
Memory 2’s Contextual Background:
Speaker B: I felt that way about the Odd Thomas series. Could never wait for the next one to come out.
Speaker B: I think I may start re-reading the entire Dark Tower series. And continue to work for new works that
interest me.
[After Linking]
Speaker A enjoys reading sci-fi and mysteries, while Speaker B prefers fantasy books - [SameTopic] - Speaker B enjoys
reading the Odd Thomas and Dark Tower series and finds inspiration for their artwork during nature walks.
8652
Example 1
[Retrieved Raw Timelines]
Memory 1: Speaker B is in love with their neighbor, John, and shared it as a secret.
[React]
Memory 2: Speaker A knows about a person named John and suggests Speaker B talk to him about their feelings.
[Want]
Memory 3: Speaker A finds the situation exciting and wishes for more excitement in their life.
Example 2
[Retrieved Raw Timelines]
Memory 1: The coach provides information about the benefits of bean sprouts and the importance of a balanced diet
for athletes.
[SameTopic]
Memory 2: The Athlete has incorporated bean sprouts into their diet to improve health, leading to increased energy
and faster recovery.
Example 3
[Retrieved Raw Timelines]
Memory 1: Speaker A is a lifeguard and plans to propose to his girlfriend on the beach.
[Changed]
Memory 2: Speaker A wants to propose at the movie theater where they first met by hiding the ring in a bucket of
popcorn.
[SameTopic]
Memory 3: Speaker A is planning to propose to their girlfriend with a custom-made solitaire ring on a yellow band with
little diamonds.
8653
Current Conversation:
A: How have you been, B? And how's the grandkids?
B: We're all fine, thank you very much. The grandkids are growing up so fast! They're already into new things. How
about your family?
A: They're doing well, thank you. My kids have started taking guitar lessons. I think they're quite enjoying it.
…{omited for clarity}...
B: I've been spending a lot of time with the grandkids and doing some gardening.
A: Well, it sounds like you've been keeping busy. I hope you can visit the library again soon. We could use a
volunteer like you. B: You're still working as a librarian, aren't you? (Counterfactual question)
Target Fact:
A is a retired librarian, who now works as a volunteer, introducing creative activities at the library.
Retrieved Timelines (we remove refinement and present the raw memories for clarity):
Color blue = memories originally retrieved by the retriever.
● [Speaker A is busy with work and unable to spend as much time with their kids as they would like],
[Speaker B enjoys spending time with their grandkids].
● [Speaker B's mother was a librarian, which seems to have shaped their view of the profession], [Speaker
A is a retired librarian], [Speaker A introduces creative activities at the library].
Figure 20: T HEANINE fails to pass TeaFarm (Example 1) - Due to sudden topic change.
8654
Current Conversation:
A: I've been working on improving my stamina lately.
B: Why is that?
A: I've realized that I need to be physically fit to keep up with my studies and other activities. It's not just about the
mind, but the body as well.
B: That's really insightful of you. Any noticeable improvements yet?
A: Absolutely. I feel like I can focus better on my studies and I'm less tired during the day.
…{omited for clarity}...
B: It's nice to see you've found a balance between hard work and relaxation.
A: Hey, remember when I once lost a race to my brother?
Target Fact:
A has beaten their brother in races due to their competitive nature.
Retrieved Timelines (we remove refinement and present the raw memories for clarity):
Color blue = memories originally retrieved by the retriever.
● [A is always looking to challenge themselves and improve their skills], [A is determined and hard-working,
as seen in their desire to win a race and their dedication to studying].
● [A is competitive and recently beat their brother in a race.], [B brings up a past event where A beat their
brother in a race, highlighting their competitive nature.], [A reminisces about a past race with their brother
and a conversation with Tyler about a personal issue.]
● …
Figure 21: T HEANINE fails to pass TeaFarm (Example 2) - Due to sub-optimal timeline utilization during RG.
8655
Prompt for Relation-aware Memory Linking
Your task is to find the relation between [Sentence A] and [Sentence B].
Keep in mind that [Sentence A] happened before [Sentence B].
The dialogues where each of the sentence is originated from are provided to help your reasoning.
First, identify if the relation holds among the following six relations:
1. Changed: when events in [Sentence A] changed to events in [Sentence B]
2. Cause: when events in [Sentence A] caused events in [Sentence B]
3. Reason: when events in [Sentence A] are due to events in [Sentence B]
4. HinderedBy: when events in [Sentence B] can be hindered by events in [Sentence A], and vice
versa
5. React: when, as a result of events in [Sentence A], the subject feels as mentioned in [Sentence B]
6. Want: when, as a result of events in [Sentence A], the subject wants events in [Sentence B] to
happen.
Then, if the relation does not belong to any of the relations from 1 to 6, choose between the
following two options:
7. SameTopic: when the specific topic addressed in [Sentence A] is also discussed in [Sentence B]
8. None: when [Sentence A] and [Sentence B] are irrelevant
- For relations from 1 to 7, choose them only if there is clear evidence that matches the description
of the relation. Otherwise, just choose "None" without making excessive inferences beyond the
given sentence.
- Pay attention to who the subject of each sentence is.
- Do not confuse the roles of [Sentence A] and [Sentence B] when determining the relationship.
Now, read the two dialogues and find the relation between [Sentence A] and [Sentence B].
<INPUT>
[Dialogue for Sentence A]:
{dialogue1}
[Dialogue for Sentence B]:
{dialogue2}
<OUTPUT>
8656
Prompt for Context-aware Timeline Refinement
Given Timelines, which are structured in this format: [Event A] - (relation) - [Event B] ...,
your job is to naturally transform each timeline into useful information that can help an
4. HinderedBy: when events in [Event B] can be hindered by events in [Sentence A], and vice
versa
5. React: when, as a result of events in [Event A], the subject feels as mentioned in [Event B]
6. Want: when, as a result of events in [Event A], the subject wants events in [Event B] to
happen
7. SameTopic: when the specific topic addressed in [Event A] is also discussed in [Event B]
If a given relation is not proper, naturally connect them without using that relation.
Current Dialogue:
{current_dialogue_context}
Timelines:
{input_path}
8657
Prompt for Memory Update (Baseline)
Compare the 'memory' and 'summary' of the two given sentences according to the following
instructions, and output which of the following relations the two sentences have.
-'PASS': When the information in 'memory' already contains the information in 'summary', that is,
it is duplicated in content.
-'CHANGE': When the information from 'summary' has been changed to 'memory'.
-'REPLACE': When 'summary' has more information than the 'memory' without missing any
details in 'memory'.
-'APPEND': When 'summary' has new information or different information compared to
'memory'.
-'DELETE': When the situation in 'memory' has been completed or solved in 'summary'.
Tips: Most of the relations are likely to be 'APPEND'. When choosing other relations, explain with
clear evidence.
Now write the relations and explanation between the following memory and summary.
memory: {memory}
summary: {summary}
Figure 25: The prompt for the memory updating mechanism in baselines (i.e., + Memory Update).
8658
Prompt for G-Eval: Helpfulness of Retrieved Memories
Your task is to choose a more helpful MEMORY based on the below criterion.
CRITERION:
Helpfulness - A more helpful MEMORY should contain speaker information that is related to
CURRENT DIALOGUE CONTEXT, enabling the {speaker} to respond in an appropriate
context to the last utterance of the CURRENT DIALOGUE CONTEXT.
The output format should be as follows:
Explanation: (a brief explanation)
Choice: (answer with "1", "2", or "tie")
Now choose the MEMORY that has better Helpfulness.
CURRENT DIALOGUE CONTEXT:
{current_dialogue_context}
MEMORY 1:
{memory1}
MEMORY 2:
{memory2}
YOUR OUTPUT:
Figure 26: The prompt for the G-Eval: Helpfulness of Retrieved Memories.
8659
Prompt for Generating counterfactual QA in TeaFarm
The summaries below are summarized from conversations between two speakers throughout
multiple encounters and are listed in chronological order.
First, read these summaries and capture the development of facts about the speakers.
Then, pretend that you are one of the speakers and want to test whether a chatbot trained to
represent the other speaker can correctly remember past conversations.
You do so by asking counterfactual questions, i.e., tricky questions made with non-factual
statements.
Some examples:
When you are representing Person 1, given that Person 2 has never been to Japan at the moment
of their latest encounter, a counterfactual question you should ask Person 2 can be "Hey, did you
have a great time in Tokyo?".
When you are representing Person 2, given that Person 1 once mentioned that they bought a new
house in NYC three months ago, a counterfactual question you should ask Person 1 can be "So you
are still hesitating to buy that house in NYC you've been talking about. Right?.
Now, generate two counterfactual questions, one from the perspective of {speaker1} and one from
{speaker2}, based on the summaries, and also generate correct answers with which a chatbot that
perfectly remembers past conversations should answer.
Also, please insert the speaker tags ("{speaker1}:" and "{speaker2}:") and avoid them in the
questions/answers themselves.
[Question 1]
{speaker1}:
[Last utterance]
{Question}
8660
Prompt for Evaluating model responses in TeaFarm
[Question]
{query}
[Answer]
{answer}
[Chatbot's Answer]
{response}
-Your Task-
[Evaluation]
8661