0% found this document useful (0 votes)
125 views17 pages

Larimar: Large Language Models With Episodic Memory Control: Equal Contribution IBM AI Research Princeton University

Larimar is a novel architecture that enhances large language models with a distributed episodic memory. This allows LLMs to dynamically update knowledge through one-shot updates without retraining. Experimental results show Larimar achieves comparable accuracy to state-of-the-art baselines on fact editing benchmarks, even in challenging sequential setups. Larimar also provides speed-ups of 4-10x over baselines and flexibility through its simple, LLM-agnostic design.

Uploaded by

who
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
125 views17 pages

Larimar: Large Language Models With Episodic Memory Control: Equal Contribution IBM AI Research Princeton University

Larimar is a novel architecture that enhances large language models with a distributed episodic memory. This allows LLMs to dynamically update knowledge through one-shot updates without retraining. Experimental results show Larimar achieves comparable accuracy to state-of-the-art baselines on fact editing benchmarks, even in challenging sequential setups. Larimar also provides speed-ups of 4-10x over baselines and flexibility through its simple, LLM-agnostic design.

Uploaded by

who
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Larimar: Large Language Models with Episodic Memory Control

Payel Das * 1 Subhajit Chaudhury * 1 Elliot Nelson 1 Igor Melnyk 1


Sarathkrishna Swaminathan 1 Sihui Dai 1 2 Aurélie Lozano 1 Georgios Kollias 1 Vijil Chenthamarakshan 1
Jiřı́ Navrátil 1 Soham Dan 1 Pin-Yu Chen 1

Abstract date the LLM can also help with the challenging problem
arXiv:2403.11901v1 [[Link]] 18 Mar 2024

of input context length generalization beyond the training


Efficient and accurate updating of knowledge
distribution,which is crucial when learning from datasets
stored in Large Language Models (LLMs) is one
where longer context instances are rare (Anil et al., 2022;
of the most pressing research challenges today.
Kazemnejad et al., 2023). A straightforward solution is to
This paper presents Larimar - a novel, brain-
fine-tune the model on the corrected/new datasets. Such
inspired architecture for enhancing LLMs with a
an approach suffers the risk of overfitting and catastrophic
distributed episodic memory. Larimar’s memory
forgetting (Kirkpatrick et al., 2017; Zhu et al., 2020), as
allows for dynamic, one-shot updates of knowl-
the knowledge is implicitly and distributionally encoded
edge without the need for computationally ex-
across the LLM parameters. Several lines of research have
pensive re-training or fine-tuning. Experimen-
proposed effective and precise LLM editing (for compre-
tal results on multiple fact editing benchmarks
hensive surveys on LLM editing, see (Li et al., 2022; Liu
demonstrate that Larimar attains accuracy com-
et al., 2023; Zhu et al., 2020)), which includes training an
parable to most competitive baselines, even in the
external memory model or a hypernetwork model to work
challenging sequential editing setup, but also ex-
alongside with the frozen LLM. Another popular approach
cels in speed—yielding speed-ups of 4-10x de-
is to locate the original fact within LLM features and then
pending on the base LLM —as well as flexibil-
do a local parameter update. As shown in Table 1, both
ity due to the proposed architecture being sim-
lines of methods face scalability problems due to overfit-
ple, LLM-agnostic, and hence general. We fur-
ting and the need for retraining or locating for new states,
ther provide mechanisms for selective fact for-
causing a slow-down in editing speed. The high memory
getting and input context length generalization
needs for storing numerous edits provide a further obsta-
with Larimar and show their effectiveness.
cle in terms of scaling to sequential and batch editing se-
tups. These challenges hinder the application of updat-
ing large language models in real-world industrial settings.
1. Introduction Further, handling fact editing and selective fact forgetting
Pre-trained Large Language Models (LLMs) have achieved appear challenging within the same methodological frame-
impressive performance on various Natural Language Pro- work even for current state-of-the-art editing methods (Patil
cessing (NLP) tasks (Devlin et al., 2018; Raffel et al., et al., 2023), while both new information learning and old
2020; Brown et al., 2020; Vaswani et al., 2017), and are information forgetting are intrinsically related to each other
often considered as knowledge repositories (Petroni et al., in in brain (Dempsey et al., 2022; Autore et al., 2023).
2019). In order to keep these models fact-relevant, safe, Humans, in contrast, can very quickly perform knowl-
and ethical after deployment - the knowledge of the LLM edge updating and generalization, both of which con-
needs to be constantly updated. Thus, it is critical to de- form to rapid learning after seeing the first relevant in-
velop efficient mechanisms to quickly update LLMs so stance. In the brain, such rapid learning is thought to
that models can protect privacy, eliminate bias and hal- depend on the hippocampus and its capacity for episodic
lucination, and catch up with new facts. Model editing memory. Consistently, while both semantic and working
should remove the undesired, incorrect, or obsolete facts memory systems struggle with sequential decision making
from the LLM’s “memory”, and optionally replace it with tasks, the episodic memory systems are found to be ben-
the desired outcome. Similarly, the ability to quickly up- eficial (Blundell et al., 2016; Lengyel and Dayan, 2007).
*
Equal contribution 1 IBM AI Research 2 Princeton University.
The complementary learning systems (CLS) theory (Ku-
Correspondence to: Payel Das <daspa@[Link]>. maran et al., 2016) provides rationale for coupling com-
plementary fast (hippocampus) and slow (neocortex) learn-

1
Larimar: Large Language Models with Episodic Memory Control

ing systems in brain, former learning from single in- model editing baselines We further subject Larimar to se-
stances while later modeling the input distribution. The lective fact forgetting and information leakage prevention
neocortex-hippocampus interactions in brain is known to and show its efficacy in those tasks. Lastly, we provide a
promote adaptive behavior via memorization and general- simple recursive search-based solution that enables Lari-
ization (Sun et al., 2023). Further, it is proposed that the mar’s memory to generalize to longer input context.
memory consolidation from hippocampus to neocortex is
Our contributions are:
facilitated through the activation synchronized with multi-
ple exact or false replays of the encoded experience in hip-
pocampus – suggesting hippocampus taking the form of a • Inspired by complementary learning mechanisms in
generative associative network (Ramirez et al., 2013). the brain, we propose a class of episodic and adaptable
memory-conditioned LLM architectures for test time
Inspired by these insights, we propose Larimar – a class of adaptation in real-time. Our method does not need any
LLMs augmented with an external episodic memory con- time-intensive gradient-based learning or fact tracing
troller. We follow the CLS view, where a hippocampal fast- within the LLM for performing the edit, providing a
learning system records samples as episodic memory, and faster alternative for LLM updating.
a neocortical slow learning system (the LLM) learns sum-
mary statistics of the input distribution as semantic mem- • We demonstrate the utility of this architecture on two
ory. Our aim is to treat the episodic memory module as relevant and challenging use cases: knowledge edit-
the global storage of the current set of factual updates or ing and input context length generalization. Lari-
edits, and enforce this memory as a condition to the LLM mar shows fast and accurate training-free adaptation
decoder. It is important to learn to update this memory ef- to new inputs in both scenarios, compared to baseline
ficiently and accurately, without having to go through any editing methods and language models.
training, as new edits arrive.
• We show selective fact forgetting and information
To tackle this, we seek to utilize a hierarchical memory, leakage prevention using one-shot memory updating.
similar in spirit to the Kanerva Machine (Wu et al., 2018a),
where the memory writes and reads are interpreted as in- • We provide a simple means to enable long context
ference in a generative model. Specifically, we consider generalization in Larimar, based on a recursive search
the memory model of (Pham et al., 2021), which treats the on its memory space.
memory as deterministic, thereby allowing reformulating
the Bayesian updates of memory and address proposed in 2. Model architecture
Kanerva Machine as finding least-square solutions to linear
systems. Once updated, this fast-learning memory is then Notation: We define input and output spaces as X and Y,
used to condition a slow-learning LLM decoder. respectively. The model comprises an encoder e : X →
RC and a decoder d : RC → Y, linked via an adap-
The use of a global memory associated a set of samples tive memory. The encoder outputs in a latent space of di-
and the ability to fast write to memory make this hierarchi- mension C. The memory uses K rows to store encoded
cal memory framework attractive for efficient LLM updat- episodes of length N , with initial state M0 ∈ RK×C and
ing with respect to new knowledge. Implementation-wise, updates through reading and writing weights W, W0 ∈
the memory is coupled to the LLM by end-to-end gradient RN ×K , resulting in updated memory M.
descent on generic data and does not assume access to ed-
its. During inference, the new data is written to memory 2.1. Training
in one-shot, the updated memory then conditions the LLM
decoding to enforce the edited output. We further formal- Given the memory M, Kanerva Machine aims to maxi-
ize training-free selective fact forgetting and information mize the conditional log-likelihood of ln p(X|M), where
leakage prevention operations based on Larimar’s one-shot X is an exchangeable (order invariant) episode: X =
memory updating mechanism. {x1 , . . . , xN }, a subset of the input data consisting of N
samples. A variational lower bound of this conditional
To our knowledge, this is the first work that proposes and likelihood is optimized, similar to in variational autoen-
demonstrates online distributed writing to a hierarchical coders (Kingma and Welling, 2013). Consequently, the
conditional memory model as a solution to test-time adap- model learns to compress X in a memory M, which then
tation of LLMs to new knowledge. We demonstrate Lari- becomes a distributed associative memory. In practice, M
mar on single and sequential fact editing tasks on existing is learned on a noisy version of the latent encodings Z + ξ
benchmarks and compared with baseline methods. Larimar where Z = e(X) for an episode. In the remainder of this
provides accurate and precise editing across these settings, study, we use M as the posterior memory dependent on
while being up to 10 times faster compared to competitive an episode X, whereas M0 denotes a prior memory. The

2
Larimar: Large Language Models with Episodic Memory Control

Encoder Memory Module Decoder


#! #$ #) #' #( <EOS>

!"#$%&
Pooler f
<latexit sha1_base64="vTFsWPD5piJIhmEhyvwXp0e3GmM=">AAAB9HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQi1WPRi8cK9gPaUDabSbt0s4m7m0oJ/R1ePCji1R/jzX/jts1BWx8MPN6bYWaen3CmtON8W4W19Y3NreJ2aWd3b/+gfHjUUnEqKTZpzGPZ8YlCzgQ2NdMcO4lEEvkc2/7odua3xygVi8WDniToRWQgWMgo0Ubyek8sQM14gFl72i9XnKozh71K3JxUIEejX/7qBTFNIxSacqJU13US7WVEakY5Tku9VGFC6IgMsGuoIBEqL5sfPbXPjBLYYSxNCW3P1d8TGYmUmkS+6YyIHqplbyb+53VTHV57GRNJqlHQxaIw5baO7VkCdsAkUs0nhhAqmbnVpkMiCdUmp5IJwV1+eZW0LqpurVq7v6zUb/I4inACp3AOLlxBHe6gAU2g8AjP8Apv1th6sd6tj0VrwcpnjuEPrM8fPkqSbQ==</latexit>

W ℎ-. Decoder block L


#' ..
...
$& $! $$ $) $. $/ $% Read operation
! .
ℎ&$ ℎ!$ ℎ$$ … ℎ%$ ℎ'$ ℎ($
!(
Encoder block N ℎ).
.. "×$ memory matrix Decoder block 2
.
Encoder block 2
Write operation !! ℎ!& ℎ!! ℎ!$ … ℎ!% ℎ!' ℎ!(
{#% , … , #& } ℎ%.
Encoder block 1 !!: Argentina won the 2023 FIFA World cup. Decoder block 1

!": Oppenheimer won the best movie award.
%*+,#- CLS The World Cup was won by Data episode of length N !# <BOS> The World … by Argen tina

)! )% )) )/ )0 )1 )* '! '% ') '* '+ ',

Figure 1. Larimar Architecture: X and Xquery respectively denote data input and query, Z, Zquery and Zr are the latent vectors, and M
is the fixed-size memory. W and W0 are reading/writing weights to memory. WM interfaces the readout from memory to the decoder.

Editor +Edit Train +Fact Trace Sequential Edit Batch Edit Forgetting/Deletion Time (GPT-2) Time (GPT-J)
ROME No Yes No No Yes 4.8s 13.9s
GRACE Yes No Yes No No 13.9s 19.3s
Larimar No No Yes Yes Yes 1.1s 1.7s

Table 1. Comparison between different editing methods from the requirement and capability perspective. ROME and GRACE from
existing editing methods are shown. The wall clock time for a single edit (averaged over 10 edits) on the CounterFact dataset for ROME
(Meng et al., 2022a) and GRACE (Hartvigsen et al., 2022) was computed using the EasyEdit (Wang et al., 2023a) framework with a
single A100 (80G) GPU.

reading weight matrix, W, is a random variable to enforce tion problem as proposed in (Pham et al., 2021), which is
generative ability of the model, for which we use a stan- minM ||Zζ − W0 M||2F . This minimization problem, which
dard Gaussian prior p(W) ∼ N (0, IN ×K ) and posterior corresponds to solving a linear system of equations, is effi-
2
q(W) ∼ N (W, σW · IN ×K ), where the mean W is esti- ciently done via computing matrix pseudo inverses.
mated from each episode and σW is learnable. The mem-
Implementation:
ory readouts are obtained as Zreadout = WM. The overall
memory-augmented architecture is depicted in Figure 1. We employed a BERT large encoder (Devlin et al., 2018)
combined with either a GPT2-large (Radford et al., 2019)
During training all the three modules – encoder (e), asso-
or a GPTJ-6B decoder and a memory matrix (512x768)
ciative memory (M), and decoder (d) – are jointly trained
for our training experiments, naming the resulting models
and optimized for an episode X, using the following loss:
Larimar-1.3B and Larimar-6B, respectively. Our training
L =EX∼data Eq(W) ln p(X|W, M) data comprised 7.6 million examples constructed by split-
ting WikiText (Merity et al., 2016) texts to small chunks
+ α ln p(d(e(X)) − βDKL (q(W)||p(W)) of 64 tokens. In testing, the Larimar-1.3B model achieved
− EX∼pretrain ln p(xi |xi−1 ..x1 ). (1) a perplexity of 14.6, while the Larimar-6B model reached
15.9 on 1,000 random WikiText samples, indicating that
The first term is the negative reconstruction loss with mem- adding memory barely affects performance. We trained
ory and W, a N × K matrix. The second is the autoen- Larimar-6B models for 10 epochs using Adam optimizer,
coder’s negative reconstruction loss without memory. The learning rate 5e-6 and batch size 32. For the Larimar-6B’s
third is the KL divergence between prior p(W) and poste- training, we used a setup with eight NVIDIA A100-80GB
rior q(W). To maintain decoder performance during train- GPUs on a single node, utilizing bfloat16 precision and Py-
ing, a pretraining data regularization term is added. Torch Lightning with the DeepSpeed ZeRO Stage 2 for ef-
ficient distributed training.
2.2. Memory inference
Once M0 is trained via backpropagation, the posterior
memory M is updated in one-shot by solving a minimiza-

3
Larimar: Large Language Models with Episodic Memory Control

3. Memory operations Algorithm 1 Basic Memory operations (Pham et al., 2021)


Write, Read, Generate operations The three basic Function write(Z):
memory operations, write in, read out, and generate, which // Z - encoding of the episode to be
act upon the Z encodings, are cast as in (Pham et al., 2021). written to memory (i.e. Z = e(X))
See Algorithm 1 for details. 1 Sample ξ ∼ N (0, σξ2 I) Let Zξ = Z + ξ Compute
addressing weight W0 = Zξ M†0 // M0
Sequential Writing and Forgetting Given an initial set is a learned parameter representing
of encodings Z0 and writing weights W0 , we initialize the prior memory
memory matrix and key covariance matrix: 2 Compute posterior memory M = W0† Zξ return M
Function read(Z, M):
M0 = W0† Z0 , C0 = W0⊤ W0 (2) // M - posterior memory from
previous write
To sequentially update the memory Mi−1 , either to add a
// Z - encoding of the read input
new set of encodings Zi or to forget a previously written
(ie. Z = e(X))
set of encodings Zi , we jointly update the memory matrix
and key covariance matrix for i = 1, 2, ... as follows: 3 Compute mean addressing weight W = ZM† Sam-
2
ple W ∼ N (W, σW I) // σW is a learned
Ci = Ci−1 + αi Wi⊤ Wi (3) parameter
4 Compute output latent Zread = WM return Zread
Mi = Mi−1 + αi C−1 ⊤
i Wi (Zi − Wi Mi−1 ) (4)
Function generate(M):
When writing new encodings to memory, we use αi = 1. // M is the posterior memory from a
When forgetting encodings which were previously written previous write
to memory with αiwrite = 1 at any iwrite < i, we use 5 Sample W ∼ N (0, I) Compute output latent Z =
αi = −1. Eq. (4) updates the memory sequentially such WM return Z
that it remains the least-squares solution for the growing
sequence of data. Assuming that Mi−1 is the least-squares
solution with respect to encodings Z0:i−1 , that is,
4. Scope Detector
i−1
X We also optionally use a scope detection mechanism to
Mi−1 = argminM ||Zj − Wj M||22 , (5)
detect if the incoming query is close to the facts writ-
j=0
ten in the memory, which is conceptually similar to
then Eq. (4) with αi = 1 ensures that Mi likewise is the SERAC (Mitchell et al., 2022). If the query is in-scope,
least-squares solution with respect to Z0:i ((Meng et al., then the corresponding readout from memory is passed to
2023)). In the case αi = −1 and Zi = Zif orget for some the decoder for memory-conditional decoding otherwise
if orget < i, Eq. (4) ensures that Mi is the least-squares the query is subjected to unconditioned decoding. We con-
solution with Zif orget removed from the data, that is, sider two different scenarios:
External encoding-based scope detector (ESD): Sample
i−1
X embeddings are estimated from an external sentence en-
Mi = argminM ||Zj − Wj M||22 , (6)
coder (MiniLM1 ) trained on 1.1B sentence pairs and with
j=0,j̸=if orget
an output space dimensionality of 384. The ESD stores en-
coded facts as vectors in its own scope storage. At test
The weights can be computed either (following (Pham time, given an encoded input sentence, 1-nearest neigh-
et al., 2021)) in terms of the current memory, Wi = bor cosine similarity is calculated and serves as detection
Zi M†i−1 , or in terms of a fixed reference memory, Wi = score. Any multi-sentence input is first split into isolated
Zi (M(ref) )† . M(ref) remains unchanged across all sequen- sentences, each of which is processed separately and maxi-
tial updates (i.e. is i-independent), is used only during mum similarity is taken. Measured on 3800 positive and
inference, and can (optionally) be constructed using the negative samples from the EasyEdit data set, this ESD
episode of data encountered during inference. In the event model achieves a detection equal-error-rate of 2.9% and an
that we wish to remove a given previously written encoding F1 score of 0.974.
from memory, the fixed nature of M(ref) allows the origi-
nal writing key Wiwrite to be recomputed at a later point in Internal Encoding-based scope detector (ISD): Larimar
the sequence if orget > iwrite , so that the information can 1
[Link]
be located in memory and removed. L6-v2

4
Larimar: Large Language Models with Episodic Memory Control

encoder e is used to embed CounterFact samples. The CounterFact dataset (Meng et al., 2022a) designed for test-
encodings are then used to train a binary scope classifier, ing language models handling of counterfactual edits. It
where positive samples come from rephrasings of an origi- includes 21,919 records to assess if the models can learn
nal fact and negative data correspond to neighboring facts. new facts rather than simply memorizing the target words.
Following other works (Meng et al., 2022a; Zheng et al.,
5. Results 2023), we used the first 2000 samples of this dataset and re-
port the average over single fact editing results for Larimar-
5.1. Wall Clock time 1.3B and Larimar-6B in Table 2. The performance scores
for baselines are from (Meng et al., 2022a; Zheng et al.,
Table 1 presents the wall clock time for each editing
2023) (see Related Work and Appendix for details on base-
method across 10 edits, calculated within the EasyEdit
line methods). As opposed to training the LLM on edits,
framework (Yao et al., 2023) on a single A100 GPU. Re-
or causally tracing the original fact within LLM and up-
sults show that Larimar is 4-10x times faster compared
dating the relevant parameters to reflect edit, we leverage
to ROME (Meng et al., 2022a) and GRACE (Hartvigsen
Larimar’s one-shot memory update for editing. Wherein,
et al., 2022), two most competitive existing LLM editing
the memory posterior is updated as the edit(s) of interest
baselines. Table 7 in Appendix further provides a edit-
is written, and then the updated memory is queried. The
time comparison within other existing baselines, as shown
read-out from the memory then conditions the decoder to
in (Yao et al., 2023), establishing Larimar’s advantage on
output the edit.
high-speed editing. Table 1 further lists Larimar’s abili-
ties to handle edits in a training- or tracing- free manner, The evaluation metrics used in Table 2 are as follows: Edit
enabling high-speed editing, handling selective forgetting, Success, which is the percent of cases where the edited fact
and maintain ability to sequential editing setup. (s, r, o∗ ), (subject, relation, object) with modified object
has higher probability than the one based on the original
Edit Success Paraphrase Neighborhood object (s, r, oc ). Specifically, column S measures percent-
Editor S M S M S M age of P[o∗ ] > P[oc ] cases, while M is the average of
GPT-2 XL 22.2 -4.8 24.7 -5.0 78.1 5.0 P[o∗ ] − P[oc ] in the logits space of the language model.
FT∗ 100.0 98.8 87.9 46.6 40.4 -6.2 Paraphrase measures the same performance on (s, r, o∗ )
FT+L∗ 99.1 91.5 48.7 28.9 70.3 3.5
KN 28.7 -3.4 28.0 -3.3 72.9 3.7
but using paraphrased prompts. Neighborhood evaluates
KE 84.3 33.9 75.4 14.6 30.9 -11.0 the model’s ability to retain knowledge about the origi-
KE-CF 99.9 97.0 95.8 59.2 6.9 -63.2 nal object but in the context of neighboring subjects s′ :
MEND 99.1 70.9 65.4 12.2 37.9 -11.6
MEND-CF 100.0 99.2 97.0 65.6 5.5 -69.9 (s′ , r, oc ). Here the column S reflects percentage of cases
ROME 100.0 97.9 96.4 62.7 75.4 4.2 where P[oc ] > P[o∗ ], while M is the average P[oc ] − P[o∗ ].
Larimar-1.3B 100.0 99.8 41.7 0.4 74.7 1.6
GPT-J 16.3 -7.2 18.6 -7.4 83.0 7.3
As can be seen, when compared to existing editing base-
lines, Larimar achieves comparable performance in suc-
FT 100.0 99.9 96.6 71.0 10.3 -50.7
FT+L 99.6 95.0 47.9 30.4 78.6 6.8 cessfully editing new facts, and in the ability to handle
MEND 97.4 71.5 53.6 11.0 53.9 -6.0 neighborhood prompts (on par with ROME, when based on
ROME 99.9 99.4 99.1 74.1 78.9 5.2
PROMPT 99.7 80.9 91.0 32.9 37.9 -2.8
GPT-2XL, and better performance, when based on GPT-
IKE (w/ 32 demonstrations) 100.0 91.7 95.2 64.5 77.0 35.2 J), while there remains room to improve generalization.
IKE (w/o paraphrases) 100.0 – 73.8 – 83.4 – When compared to existing in-context editing approaches
IKE (w/o neighbors) 100.0 – 99.8 – 11.5 –
Larimar-6B 99.6 96.0 76.5 22.4 80.2 3.9 (PROMPT and IKE baselines) (Zheng et al., 2023), Lari-
mar does not need multiple in-context demonstrations of
Table 2. Single fact edit performance on CounterFact dataset the edit and its paraphrases, as well as of neighboring facts,
comparing Larimar against baseline. Top two best systems are to the decoder, which are retrieved from a corpus. How-
highlighted. Unlike other methods Larimar uses dynamic mem- ever, as we show in the Appendix (Table 8), when Lari-
ory updates with memory-conditioned decoding, and does not re- mar has access to one additional paraphrase per fact, by
quire gradient update on edit samples, as opposed to methods that writing it in the memory, the generalization performance
require training (FT, FT+L, MEND) or tracing plus decoder up-
increases from 76.5 to 82.8. Note that in this setup the av-
dating (ROME) on edit samples (ROME), or in-context demon-
erage number of added paraphrase per fact is one and we
strations (IKE) of (paraphrased) edits and neighboring samples
retrieved from a corpus. queried the model with a paraphrased prompt unseen by
the memory. Further, ablation experiments in Appendix
shows that a scope detector, either trained on Larimar en-
5.2. Single Fact editing codings or encodings from an external LLM, helps with
We compare the performance of Larimar against a number better paraphrase generalization and neighborhood speci-
of recently proposed knowledge editing approaches on the ficity. Throughout the paper, Larimar is configured with

5
Larimar: Large Language Models with Episodic Memory Control

a scope detector, unless otherwise mentioned. For details, Editor Test Retention Rate Edit Retention Rate
see Appendix. MEND 0.25 0.27
GRACE 0.69 0.93
We also evaluated Larimar on the ZsRE benchmark (Levy Larimar-1.3B 14.6* 0.97
et al., 2017), a QA dataset for relation extraction through Larimar-6B 15.9* 0.92
reading comprehension, with results displayed in Appendix
in Table 12. Performance scores for GPT-2 XL based base- Table 3. Sequential editing on ZsRE dataset, showing Larimar
lines are cited from (Meng et al., 2022a), whereas perfor- does not forget older edits. *We report perplexity on wikitext es-
mance of ROME on GPT-J was independently estimated by timated by a separate language model as Test Retention Rate for
us. Unlike the CounterFact evaluation, this assessment uses Larimar, whereas mean F1 on NQ test set is reported for MEND
exact match counts for scoring I[o∗ = argmaxo P[o]]. Com- and GRACE on GPT2-XL (Hartvigsen et al., 2022).
pared to baselines, Larimar demonstrates effective editing
and comparable neighborhood specificity on ZsRE, with
dataset with more rephrasings and fewer (≈ 500) ZsRE
slightly lower generalization, maintaining consistent re-
facts. Our analysis, depicted in Figure 3, examines the
sults across GPT-2 and GPT-J decoders, underscoring its
mean F1 score on the holdout set against the number of
model-agnostic editing capabilities.
memory writes using the edit set, compared to GRACE
on the same datasets.2 As Larimar has no knowledge of
5.3. Sequential Fact Editing
upcoming edits, it starts with near-zero F1; in contrast,
We evaluated Larimar’s ability to perform sequential edit- GRACE has prior knoweldge from training on the edit
ing, following the setup of (Hartvigsen et al., 2022), which set. As the sequence of edits grows, Larimar surpasses
tackles the issue of forgetting previous edits after multi- GRACE’s generalization performance at around 600 edits.
ple sequential edits. Hartvigsen et. al. introduced a con- In these experiments, we use K = 1000, setting the mem-
tinual editing method that integrates an adaptor to update ory size proportional to the number of facts to be written.
a codebook of edit key-value pairs with a pre-trained lan- We also checked an alternative method (see Appendix E)
guage model (GPT2-XL), showing memory retention dur- for computing the reading and writing weights, which uses
ing sequential edits. We adapt Larimar to this experimen- a Gaussian convolution to store each encoding z in memory
tal setup, wherein a subset of 200 facts with 5 rephrasings location(s) corresponding to the most similar content in a
each is selected from the ZsRE validation dataset for test- reference memory M(ref) , and which we found to perform
ing. In Larimar, a sequential edit is handled by updating better than the pseudoinverse method of (Pham et al., 2021)
the global memory through Eq. (4), again requiring no when there are a relatively small number of rephrasings per
gradient-based update on incoming edits. For each edit, the fact (see Appendix, Figure 5).
encoding of the rephrased query concatenated with the cor-
responding answer is written to memory. We assessed Lari-
5.4. Selective Forgetting
mar’s performance, compared to GRACE, using the edit re-
tention rate (ERR), which is the mean F1 score after 1000
sequential edits when querying the memory with the en-
coded query Zquery for each written fact. Larimar is not
finetuned on question-answer data; instead, we write each
question-answer pair as a fact in the memory and query the
memory with the original question. We report on the Test
Retention Rate (TRR) by evaluating Larimar decoder’s per-
plexity on 1000 random test samples from wikitext using a
separate language model. In comparison, baseline mod-
els compute TRR from mean F1 scores from 1000 random
samples of NQ data. Results show Larimar’s comparable
ERR performance to GRACE, while preserving its origi-
nal test set performance. Notably, Larimar-1.3B achieves Figure 2. Batch editing accuracy on counterfact dataset. Baseline
editing speeds approximately 10 or more times faster than performances are taken from (Meng et al., 2023). Green: MEMIT,
Orange: ROME, Magenta: MEND, Black: Larimar-6B.
GRACE on GPT-2 XL.
We also evaluated Larimar’s generalization to rephrased The goal of this section is to check if a specific fact can be
prompts, again comparing to GRACE. We use both (i) a selectively erased from N facts that are have been written
dataset of 1000 ZsRE facts, each with 10 variations, di-
2
vided into edit and holdout sets, and (ii) an edit/holdout We use GRACE with ϵinit = 3.0 to edit block 4 of T5
(Hartvigsen et al., 2022).

6
Larimar: Large Language Models with Episodic Memory Control

in Larimar’s memory. We first checked if many edits can Counterfact ZsRE


Model Forgotten Retained Forgotten Retained
be written at once to memory and accurately retrieve from Llama2 13B, N = 20, 6-shot 0.75 0.77 0.68 0.73
it. Figure 2 shows that the rewrite accuracy is near 100% Larimar 1.3B, N = 1 0.0 – 0.0 –
Larimar 1.3B, N = K 0.001 0.997 0.02 0.95
for up to 512 edits (the memory size K) and then drops to
Larimar 1.3B, N = 2K 0.02 0.79 0.03 0.52
82% for 1024 edits. This result shows Larimar’s ability to Larimar 6B, N = 1 0.0 – 0.0 –
compress more than K facts into its size-K memory. This Larimar 6B, N = K 0.0 0.993 0.03 0.86
Larimar 6B, N = 2K 0.03 0.71 0.04 0.50
performance level is higher when compared to baselines
like MEND and ROME, but subpar compared to MEMIT Table 4. Fraction of facts with accurate recall, for the Counterfact
(Meng et al., 2023), which can accurately handle a very and ZsRE datasets, after writing N facts to memory and removing
large batch of edits at a cost of reduced editing speed (see one. “Forgotten” and “Retained” indicate, respectively, recall of
Table 7) and is also not meant to handle sequential editing. the fact to which forgetting was applied, and mean recall of the
Note that Larimar’s recall matches MEMIT for N < K N − 1 retained facts. K = 512 in all cases.
facts, and K can be chosen as needed during inference.
To test the ability of Larimar for selectively forgetting spec- Attack Success (%)
ified facts during inference, we first write N facts to mem- Larimar (single) 17.6
ory (αi = 1 in Eq. (4)), and then forget one fact (αi = −1), Larimar (batch) 21.5
and also write to memory in its place (with αi = 1) the ROME (single) 29.0
MEMIT (single) 49.3
same fact with the answer replaced with the string “un-
known.” We compare recall for the forgotten fact before
and after the forgetting operation. To demonstrate that for- Table 5. Input rephrasing attack success for a budget of 20 on
getting does not compromise other memories, we also re- Counterfact samples. Writing [prompt + ”unknown”] to Larimar-
6B’s memory in single fact or batch mode is more effective in
port the recall on the remaining N −1 facts in memory. The
preventing answer leakage in generations, when compared to up-
samples used are from the ZsRE validation set and from dating the decoder with an empty model response objective for a
the Counterfact test set. Table 4 reports these results, com- single fact via direct model editing methods on GPT-J 6B.
paring to a k-shot in-context learning (see Appendix for
example prompt) baseline with Llama2-13B, and showing
that Larimar can selectively forget using the memory up-
dating mechanism, while retaining the remaining knowl- the authors updated the decoder d with an empty response
edge, whereas in-context learning struggles. objective for a given string x that is known to the decoder
and is aimed to be deleted, such that the probability of an
“empty” target string E is maximized, argmaxd P[E|x, d].
0.8 To probe into the effectiveness of information deletion,
they used a blackbox input rephrasing attack, where the
0.6 presence of information of interest was checked in a num-
Holdout (F1)

ber of model outputs as the model was prompted with dif-


ferent input rephrases. For Larimar, a single input prompt
0.4
followed by “unknown” (= empty response) is written to
Larimar, 511 facts
Larimar, 1000 facts the memory during inference to prevent leakage of the an-
0.2 swer in the decoded outputs. The attack is considered suc-
GRACE, 511 facts
GRACE, 1000 facts cessful on a given input prompt if the answer is found
0.0 within a fixed number of model generations obtained us-
0 500 1000 1500 2000 2500 3000
Edits ing prompt rephrases. Samples from Counterfact known
to GPT-J 6B and Larimar’s decoder were used for this ex-
Figure 3. Mean F1 score on a held-out set of unseen rephrasings
periment. We used 5 sample responses for each of 4 para-
from ZsRE over a sequence of 3000 edits, showing Larimar’s phrases per fact (total attack budget 20), which were gen-
better generalization over GRACE on two datasets with 1000 or erated as prescribed in (Patil et al., 2023). Table 5 shows
511 independent facts (10 and ≈ 20 rephrasings per fact, respec- the results, suggesting that writing to Larimar’s memory is
tively). more effective than direct model editing methods for pre-
venting answer leakage for a single input prompt (17.6%
We also evaluate Larimar to prevent generation of specific attack success for Larimar, vs. 29% and 49% for ROME
information by writing an empty (i.e. censored) response and MEMIT, respectively). Larimar can further be used
for the corresponding prompt to the memory. The baselines to restrict the response for a batch of facts in one shot –
we consider are ROME and MEMIT, which were adapted the robustness to rephrase attacks remains still higher than
to delete information in (Patil et al., 2023). Specifically, baselines.

7
Larimar: Large Language Models with Episodic Memory Control

5.5. Generalization to long input context helps controlling the decoding, thus allowing truthful copy-
ing from context in a generalized manner. The memory
We perform fact recall with long context using data that is
control also provides real-time knowledge editing as well
not present in the base decoders pretraining corpus. For
as information leakage prevention. Attending to the mem-
this purpose, we curated facts from CNN Fast Facts (CNN,
ory read-out while decoding uses O(1) memory to predict
2023) for 2021, 2022 and 2023.
each token, providing memory and computational benefits.
We divide the input text into T chunks, which is in the
range of Larimar’s training context window, and store each
of these chunks in a separate memory Mi , i = 1..T . Given 6. Related work
a query, we address and read from each of these memo-
ries. The readouts from these memories then form the ba- Memory-augmented NNs External memory augmented
sis of the successive memory, which is then queried and neural networks (MANNs) were already proposed in pre-
read from again. This process is continued until the num- transformer era, with the aim of better learning long-term
ber of readout in the final memory is similar to Larimar’s dependencies in input data (Weston et al., 2014; Graves
input training context window. This recursive search in et al., 2014; Miller et al., 2016) showing enhanced per-
the latent memory space and using readouts to construct formance in generative tasks, language modeling, long-
new higher-level memory is performed to process the long term planning, and sample-efficient RL, etc. MANNs add
context with Larimar’s memory trained on a relative small a trainable slot-based memory to a recurrent neural net.
episode length. The retrieved Zr from the final successor An attention-based reading mechanism is typically used
memory is passed to the decoder to predict the response. It to compute a weighted average of memory contents. This
should be noted that memory hierarchy is also found in hip- mechanism is estimated from training data, and thus it re-
pocampus and thought to be implicated in learning (Collin mains unclear how they can generalize to new data. Al-
et al., 2015). ternatively, Kanerva Machine (Wu et al., 2018a), inspired
by Kanerva’s sparse distributed memory model (Kanerva,
Table 6 shows Larimar’s recall performance does not de-
1988), views memory as a global latent variable in a gen-
grade much with increasing input context length, even
erative model and aims to learn a memory dependent data
compared to some of most competitive baseline LLMs
prior and learnable addresses. In this framework, the mem-
trained with longer training context. We also compare with
ory update and read/write are considered as Bayesian in-
Supersizing Transformer (Klett and Ahle, 2023) which is a
ference, i.e., the posterior parameters are updated as new
memory-augmented model, however it did not show com-
data arrives. KM and its successors (Wu et al., 2018b;
petitive recall performance because it was not trained to
Ramapuram et al., 2022; Pham et al., 2021) show that these
perform memory-conditioned generation. Due to memory
conditional generative memory models offer better perfor-
processing in the latent space, Larimar is also efficient is
mance on image reconstuction, denoising, and generation
terms of number of KV cache token computation compared
tasks compared to variational autoencoders (Kingma and
to baseline methods. Our experiments on 128 facts case
Welling, 2013) and memory networks (Bornschein et al.,
shows that the average time required by Larimar to read
2017). However, to our knowledge this is the first report on
from memory is 0.36s compared to 1.44s for Mistral7b base
investigating how those models can adapt to LLM and aid
model.
in their knowledge updating.
Learning to copy from the context remains an important
Transformers struggle with accessing and updating long-
aspect underlying transformers’ impressive language mod-
term memory (Fan et al., 2021). Efforts to extend input
eling and other abilities (Devlin et al., 2018; Raffel et al.,
context length for better performance encounter issues inte-
2020; Olsson et al., 2022). LLMs with non-attention based
grating inherent model knowledge with external facts, lack-
architectures, such as state space models, often underper-
ing robustness (Li et al., 2022; Liu et al., 2023). Augment-
form (Gu et al., 2022; Gu and Dao, 2023) transformers in
ing transformers with external, non-differentiable memory
language modeling, which is at least partly attributed to an
and k-nearest neighbor (kNN) attention has shown promise
inability to copy from the context, as well as an inability to
in improving language modeling by utilizing additional
generalize to longer contexts, when compared to transform-
context (Grave et al., 2017; Khandelwal et al., 2019). How-
ers (Jelassi et al., 2024). Those investigations have fueled
ever, kNN-augmented models face challenges in control-
research on hybrid architectures. The results presented here
ling memory during decoding, leading to difficulties in up-
suggest that combining a hierarchical memory model with
dating facts due to conflicts between encoded knowledge
a generative pretrained transformer, as in Larimar, could be
and real-time information (Li et al., 2022; Liu et al., 2023;
a promising path in that direction. The end-to-end training
Zhu et al., 2020).
of the fixed-size latent memory with the decoder in Lari-
mar adds an explicit state to the decoder, writing to which

8
Larimar: Large Language Models with Episodic Memory Control

Method Train Context nf act = 64 nf act = 96 nf act = 128 nf act = 256


mistral-7b (3-shot) 8192 0.98 / 2655 0.96 / 3495 0.57 / 4334 0.42 / 7417
gpt-neox-20b (3-shot) 2048 0.52 / 2366 0.36 / 3193 0.33 / 4020 0.35 / 7231
llama2-13b (3-shot) 4096 0.97 / 2755 0.66 / 3628 OOM OOM
Supersizing Transformer 2048 0.39 / 1462 0.39 / 2249 0.37 / 3072 0.37 / 6201
Supersizing Transformer + filtering 2048 0.72 / 1640 0.71 / 2375 0.70 / 3110 0.69 / 5809
Larimar-1.3b 384/2048 0.89 / 1565 0.88 / 2276 0.88 / 2988 0.86 / 5607
Larimar-6b 384/2048 0.82 / 1565 0.81 / 2276 0.81 / 2988 0.80 / 5607

Table 6. Novel fact addition recall rate on FastFacts. Larimar shows good recall performance and can extrapolate to higher context length
trained than it was trained on. Baseline models show good recall on small context but recall degrades significantly for higher context.

Model Editing For comprehensive surveys of editing ap- Different from the above-mentioned works, we present
proaches see (Yao et al., 2023; Zhang et al., 2024; Wang a novel approach to augment Large Language Models
et al., 2023b). Editing methods can be broadly catego- (LLMs) with generative memory, enabling dynamic edit-
rized into three categories: ‘Recognition Phase’, ‘Associa- ing and adaptation without retraining. This differs from
tion Phase’ and ‘Mastery Phase’ (Zhang et al., 2024). The traditional methods that update LLM parameters (Meng
‘recognition phase’-targeting methods consider demon- et al., 2022a;b) or external memories (Han et al., 2023;
strating right context to help the LLM output correct facts, Hartvigsen et al., 2022) and does not require multiple
either via in-context demonstrations of similar examples demonstrations for control Zheng et al. (2023).
(Zheng et al., 2023), or training an external model on edits
Larimar’s forgetting operation does not use negative exam-
(Mitchell et al., 2022).The ‘association phase’ -related edit-
ples to fine-tune LLMs for unlearning (Yu et al., 2023).
ing methods consider merging new knowledge to that of the
Neither Larimar requires tailored fine-tuning (Eldan and
base LLM, either by patching (adding and training) error-
Russinovich, 2023) or inserting extra layers (Chen and
specific neurons (Huang et al., 2023), or by adding a an
Yang, 2023), and is complimentary to in-context unlearn-
adaptor storing edit key-value pairs to a specific LLM layer
ing approaches such as (Pawelczyk et al., 2023) for fact
(Hartvigsen et al., 2022). The ‘mastery phase’ methods
forgetting.
learn to update base LLM’s own parameters. Examples are
regularized finetuning (Zhu et al., 2020) and hypernetwork-
based methods (Mitchell et al., 2021; De Cao et al., 2021). 7. Conclusions
Recent works also explore the ‘locate-then-edit’ approach:
In this work, we propose augmenting LLMs with a dy-
(Meng et al., 2022a;b) first perform a causal tracing to de-
namically updatable and distributed episodic memory as
tect which part of hidden states can be attributable to the
a means to online knowledge adaptation. By exploiting
fact and then do a rank-one update of the corresponding
a one-shot memory update mechanism, combined with
weight parameters to directly write in the updated fact.
memory-conditioned decoding, the proposed framework
Current model editing approaches, while promising (Yao shows accurate, precise, robust, and significantly faster
et al., 2023), face significant limitations, such as high train- editing performance compared to baselines in single-fact,
ing costs and difficulties in generalizing to new data . These as well as the challenging sequential editing experiments.
methods often cannot efficiently update Large Language We exploit the same memory updating mechanism to en-
Models (LLMs) due to extensive time and memory require- able a fast and selective fact forgetting operation, as well
ments (Mitchell et al., 2022). Furthermore, the assumption as an effective information deletion mechanism. We also
that knowledge within LLMs is localized has been chal- provide a simple approach for handling long input context
lenged (Hase et al., 2023), indicating that simple parame- by recursively reading from Latimar’s memory space, re-
ter updates may not be effective for comprehensive edits. vealing better fact recall from long input context by Lari-
The performance of LLMs degrades with multiple edits, mar when compared to state-of-the-art LLMs trained with a
leading to issues like knowledge forgetting and distortion much larger training context window. The proposed frame-
(Mitchell et al., 2022; Meng et al., 2023; Gupta et al., 2024; work thus provides a simple, general, and principled ap-
Li et al., 2023; Gu et al., 2024). Alternatives like external proach to update LLMs in real-time by coupling them with
cache or memory-based editing have been proposed to cir- an adaptable episodic memory control.
cumvent direct model modifications, yet challenges in se-
lectively forgetting outdated or sensitive knowledge persist
(Ishibashi and Shimodaira, 2023; Patil et al., 2023).

9
Larimar: Large Language Models with Episodic Memory Control

8. Broader Impact and Ethical James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel
Considerations Veness, Guillaume Desjardins, Andrei A Rusu, Kieran
Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-
This paper presents work whose goal is to advance the field Barwinska, et al. Overcoming catastrophic forgetting in
of machine learning and large language models. There are neural networks. Proceedings of the national academy
many potential societal consequences of our work, none of sciences, 114(13):3521–3526, 2017.
which we feel must be specifically highlighted here.
Chen Zhu, Ankit Singh Rawat, Manzil Zaheer, Srinadh
Bhojanapalli, Daliang Li, Felix Yu, and Sanjiv Kumar.
References Modifying memories in transformer models, 2020.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova. Bert: Pre-training of deep bidirectional Daliang Li, Ankit Singh Rawat, Manzil Zaheer, Xin Wang,
transformers for language understanding. arXiv preprint Michal Lukasik, Andreas Veit, Felix Yu, and Sanjiv Ku-
arXiv:1810.04805, 2018. mar. Large language models with controllable working
memory, 2022.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape,
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Michele Bevilacqua, Fabio Petroni, and Percy Liang.
Li, and Peter J Liu. Exploring the limits of transfer learn- Lost in the middle: How language models use long con-
ing with a unified text-to-text transformer. The Journal texts, 2023.
of Machine Learning Research, 21(1):5485–5551, 2020.
Vaidehi Patil, Peter Hase, and Mohit Bansal. Can sensitive
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- information be deleted from llms? objectives for defend-
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- ing against extraction attacks, 2023.
lakantan, Pranav Shyam, Girish Sastry, Amanda Askell,
et al. Language models are few-shot learners. Ad- William P Dempsey, Zhuowei Du, Anna Nadtochiy,
vances in neural information processing systems, 33: Colton D Smith, Karl Czajkowski, Andrey Andreev,
1877–1901, 2020. Drew N Robson, Jennifer M Li, Serina Applebaum,
Thai V Truong, et al. Regional synapse gain and loss
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob accompany memory formation in larval zebrafish. Pro-
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, ceedings of the National Academy of Sciences, 119(3):
and Illia Polosukhin. Attention is all you need. Advances e2107661119, 2022.
in neural information processing systems, 30, 2017.
Livia Autore, James D O’Leary, Clara Ortega-de San Luis,
and Tomás J Ryan. Adaptive expression of engrams by
Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick
retroactive interference. bioRxiv, pages 2023–03, 2023.
Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander
Miller. Language models as knowledge bases? In Pro- Charles Blundell, Benigno Uria, Alexander Pritzel, Yazhe
ceedings of the 2019 Conference on Empirical Methods Li, Avraham Ruderman, Joel Z Leibo, Jack Rae, Daan
in Natural Language Processing and the 9th Interna- Wierstra, and Demis Hassabis. Model-free episodic con-
tional Joint Conference on Natural Language Process- trol, 2016.
ing (EMNLP-IJCNLP), pages 2463–2473, 2019.
Máté Lengyel and Peter Dayan. Hippocampal contribu-
Cem Anil, Yuhuai Wu, Anders Johan Andreassen, Aitor tions to control: the third way. Advances in neural infor-
Lewkowycz, Vedant Misra, Vinay Venkatesh Ramasesh, mation processing systems, 20, 2007.
Ambrose Slone, Guy Gur-Ari, Ethan Dyer, and Behnam
Neyshabur. Exploring length generalization in large Dharshan Kumaran, Demis Hassabis, and James L McClel-
language models. In Alice H. Oh, Alekh Agar- land. What learning systems do intelligent agents need?
wal, Danielle Belgrave, and Kyunghyun Cho, editors, complementary learning systems theory updated. Trends
Advances in Neural Information Processing Systems, in cognitive sciences, 20(7):512–534, 2016.
2022. URL [Link] Weinan Sun, Madhu Advani, Nelson Spruston, Andrew
id=zSkYVeX7bC4. Saxe, and James E Fitzgerald. Organizing memories for
generalization in complementary learning systems. Na-
Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Nate- ture neuroscience, 26(8):1438–1448, 2023.
san Ramamurthy, Payel Das, and Siva Reddy. The im-
pact of positional encoding on length generalization in Steve Ramirez, Xu Liu, Pei-Ann Lin, Junghyup Suh,
transformers, 2023. Michele Pignatelli, Roger L Redondo, Tomás J Ryan,

10
Larimar: Large Language Models with Episodic Memory Control

and Susumu Tonegawa. Creating a false memory in the Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettle-
hippocampus. Science, 341(6144):387–391, 2013. moyer. Zero-shot relation extraction via reading com-
prehension. arXiv preprint arXiv:1706.04115, 2017.
Yan Wu, Greg Wayne, Alex Graves, and Timothy Lillicrap.
The kanerva machine: A generative distributed memory, CNN. 2023 in review fast facts, 2023. URL:
2018a. \[Link]
2023-in-review-fast-facts/[Link].
Kha Pham, Hung Le, Man Ngo, Truyen Tran, Bao Ho,
and Svetha Venkatesh. Generative pseudo-inverse mem- Silvy HP Collin, Branka Milivojevic, and Christian F
ory. In International Conference on Learning Represen- Doeller. Memory hierarchies map onto the hippocam-
tations, 2021. pal long axis in humans. Nature neuroscience, 18(11):
1562–1564, 2015.
Kevin Meng, David Bau, Alex Andonian, and Yonatan Be-
linkov. Locating and editing factual associations in gpt. Phoebe Klett and Thomas Ahle. Supersizing trans-
Advances in Neural Information Processing Systems, 35: formers: Going beyond rag with extended minds
17359–17372, 2022a. for llms. The Normal Blog, 2023. URL: https:
Thomas Hartvigsen, Swami Sankaranarayanan, Hamid //[Link]/posts/
Palangi, Yoon Kim, and Marzyeh Ghassemi. Aging with 2023-09-12-supersizing-transformers/
grace: Lifelong model editing with discrete key-value [Link].
adaptors. arXiv preprint arXiv:2211.11031, 2022. Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas
Peng Wang, Ningyu Zhang, Xin Xie, Yunzhi Yao, Bozhong Joseph, Nova DasSarma, Tom Henighan, Ben Mann,
Tian, Mengru Wang, Zekun Xi, Siyuan Cheng, Kangwei Amanda Askell, Yuntao Bai, Anna Chen, Tom Con-
Liu, Guozhou Zheng, and Huajun Chen. Easyedit: An erly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds,
easy-to-use knowledge editing framework for large lan- Danny Hernandez, Scott Johnston, Andy Jones, Jackson
guage models, 2023a. Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei,
Tom Brown, Jack Clark, Jared Kaplan, Sam McCan-
Diederik P Kingma and Max Welling. Auto-encoding vari- dlish, and Chris Olah. In-context learning and induction
ational bayes. arXiv preprint arXiv:1312.6114, 2013. heads, 2022.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Albert Gu, Karan Goel, and Christopher Ré. Efficiently
Dario Amodei, Ilya Sutskever, et al. Language models modeling long sequences with structured state spaces,
are unsupervised multitask learners. OpenAI blog, 1(8): 2022.
9, 2019.
Albert Gu and Tri Dao. Mamba: Linear-time sequence
Stephen Merity, Caiming Xiong, James Bradbury, and modeling with selective state spaces, 2023.
Richard Socher. Pointer sentinel mixture models, 2016.
Samy Jelassi, David Brandfonbrener, Sham M. Kakade,
Kevin Meng, Arnab Sen Sharma, Alex J Andonian, and Eran Malach. Repeat after me: Transformers are
Yonatan Belinkov, and David Bau. Mass-editing mem- better than state space models at copying, 2024.
ory in a transformer. In The Eleventh International Con-
ference on Learning Representations, 2023. Jason Weston, Sumit Chopra, and Antoine Bordes. Mem-
ory networks. arXiv preprint arXiv:1410.3916, 2014.
Eric Mitchell, Charles Lin, Antoine Bosselut, Christo-
pher D Manning, and Chelsea Finn. Memory-based Alex Graves, Greg Wayne, and Ivo Danihelka. Neural tur-
model editing at scale. In International Conference ing machines. arXiv preprint arXiv:1410.5401, 2014.
On Machine Learning, Vol 162. JMLR-JOURNAL MA-
CHINE LEARNING RESEARCH, 2022. Alexander Miller, Adam Fisch, Jesse Dodge, Amir-
Hossein Karimi, Antoine Bordes, and Jason Weston.
Yunzhi Yao, Peng Wang, Bozhong Tian, Siyuan Cheng, Key-value memory networks for directly reading docu-
Zhoubo Li, Shumin Deng, Huajun Chen, and Ningyu ments, 2016.
Zhang. Editing large language models: Prob-
lems, methods, and opportunities. arXiv preprint Pentti Kanerva. Sparse distributed memory. MIT press,
arXiv:2305.13172, 2023. 1988.

Ce Zheng, Lei Li, Qingxiu Dong, Yuxuan Fan, Zhiyong Yan Wu, Greg Wayne, Karol Gregor, and Timothy Lilli-
Wu, Jingjing Xu, and Baobao Chang. Can we edit factual crap. Learning attractor dynamics for generative mem-
knowledge by in-context learning?, 2023. ory, 2018b.

11
Larimar: Large Language Models with Episodic Memory Control

Jason Ramapuram, Yan Wu, and Alexandros Kalousis. Akshat Gupta, Anurag Rao, and Gopala Anumanchipalli.
Kanerva++: extending the kanerva machine with differ- Model editing at scale leads to gradual and catastrophic
entiable, locally block allocated latent memory, 2022. forgetting. arXiv preprint arXiv:2401.07453, 2024.

Jörg Bornschein, Andriy Mnih, Daniel Zoran, and Danilo J. Zhoubo Li, Ningyu Zhang, Yunzhi Yao, Mengru Wang,
Rezende. Variational memory addressing in generative Xi Chen, and Huajun Chen. Unveiling the pitfalls of
models, 2017. knowledge editing for large language models, 2023.

Angela Fan, Thibaut Lavril, Edouard Grave, Armand Jia-Chen Gu, Hao-Xiang Xu, Jun-Yu Ma, Pan Lu, Zhen-
Joulin, and Sainbayar Sukhbaatar. Addressing some lim- Hua Ling, Kai-Wei Chang, and Nanyun Peng. Model
itations of transformers with feedback memory, 2021. editing can hurt general abilities of large language mod-
els, 2024.
Edouard Grave, Moustapha Cisse, and Armand Joulin. Un-
bounded cache model for online language modeling with Yoichi Ishibashi and Hidetoshi Shimodaira. Knowledge
open vocabulary, 2017. sanitization of large language models, 2023.

Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Xiaoqi Han, Ru Li, Hongye Tan, Wang Yuanlong,
Zettlemoyer, and Mike Lewis. Generalization through Qinghua Chai, and Jeff Pan. Improving sequen-
memorization: Nearest neighbor language models. tial model editing with fact retrieval. In Houda
arXiv preprint arXiv:1911.00172, 2019. Bouamor, Juan Pino, and Kalika Bali, editors, Find-
ings of the Association for Computational Linguis-
Ningyu Zhang, Yunzhi Yao, Bozhong Tian, Peng Wang, tics: EMNLP 2023, pages 11209–11224, Singa-
Shumin Deng, Mengru Wang, Zekun Xi, Shengyu Mao, pore, December 2023. Association for Computational
Jintian Zhang, Yuansheng Ni, et al. A comprehensive Linguistics. doi: 10.18653/v1/[Link]-emnlp.
study of knowledge editing for large language models. 749. URL [Link]
arXiv preprint arXiv:2401.01286, 2024. findings-emnlp.749.

Song Wang, Yaochen Zhu, Haochen Liu, Zaiyi Zheng, Charles Yu, Sullam Jeoung, Anish Kasi, Pengfei Yu, and
Chen Chen, and Jundong Li. Knowledge editing for Heng Ji. Unlearning bias in language models by par-
large language models: A survey, 2023b. titioning gradients. In Findings of the Association for
Computational Linguistics: ACL 2023, pages 6032–
Zeyu Huang, Yikang Shen, Xiaofeng Zhang, Jie Zhou, 6048, 2023.
Wenge Rong, and Zhang Xiong. Transformer-patcher:
One mistake worth one neuron. In The Eleventh In- Ronen Eldan and Mark Russinovich. Who’s harry potter?
ternational Conference on Learning Representations, approximate unlearning in llms, 2023.
2023. URL [Link] Jiaao Chen and Diyi Yang. Unlearn what you want to
id=4oYUGeGBPm. forget: Efficient unlearning for llms. arXiv preprint
arXiv:2310.20150, 2023.
Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea
Finn, and Christopher D Manning. Fast model editing Martin Pawelczyk, Seth Neel, and Himabindu Lakkaraju.
at scale. arXiv preprint arXiv:2110.11309, 2021. In-context unlearning: Language models as few shot un-
learners. arXiv preprint arXiv:2310.07579, 2023.
Nicola De Cao, Wilker Aziz, and Ivan Titov. Editing
factual knowledge in language models. arXiv preprint Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao
arXiv:2104.08164, 2021. Chang, and Furu Wei. Knowledge neurons in pretrained
transformers. arXiv preprint arXiv:2104.08696, 2021.
Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan
Belinkov, and David Bau. Mass-editing memory in a
transformer. arXiv preprint arXiv:2210.07229, 2022b.

Peter Hase, Mohit Bansal, Been Kim, and Asma Ghan-


deharioun. Does localization inform editing? surpris-
ing differences in causality-based localization vs. knowl-
edge editing in language models. In Thirty-seventh
Conference on Neural Information Processing Systems,
2023. URL [Link]
id=EldbUlZtbd.

12
Larimar: Large Language Models with Episodic Memory Control

A. Baselines of N facts, half of which are marked with a prefix string


(e.g. “[UNKNOWN]”), followed by K examples of ques-
FT Fine-Tuning (FT) uses Adam optimization with early tions and answers (prior to a final query to the model), half
stopping, focusing on adjusting mlpproj weights in one of which correspond to facts marked with the prefix string,
layer to optimize the training loss. which replaces the answer, indicating that the fact should
be treated as forgotten.
FT+L Constrained fine-tuning (FT+L), as in (Zhu et al.,
2020), authors apply an L∞ norm constraint by clamping Editor COUNTERFACT
weights no to exceed ϵ range at each gradient step. They FT 35.94 s
chose layer 0 and ϵ = 5×10−4 for GPT-2, and ϵ = 5×10−5 SERAC 5.31 s
for GPT-J. CaliNet 1.88 s
T-Patcher 1864.74 s
KN This is a method by (Dai et al., 2021) which selects KE 2.20 s
neurons that are associated with knowledge expression via MEND 0.51 s
gradient-based attributions, and then modifies MLP at the KN 225.43 s
rows corresponding to those neurons by adding scaled em- ROME 147.2 s
bedding vectors. MEMIT 143.2 s

KE Knowledge editor (KE) (De Cao et al., 2021) learn Table 7. Wall clock time for each edit method for performing 10
an LSTM sequence model that uses gradient information edits from CounterFact benchmark, as reported in (Yao et al.,
to predict rank-1 weight changes to the model. KE-CF / 2023).
KE-ZsRE is additionally trained model on training set of
CounterFact / ZsRE dataset. A.1. Prompt Example For Forgetting with ICL
experiments
MEND Model Editor Networks with Gradient Decom-
position (MEND) (Mitchell et al., 2021) learn a rank-1 de- Below, we show an example ICL prompt for N = 20 and
composition of the negative log likelihood gradient with re- K = 6, used for Llama2-13B in Table 4:
spect to some subset of parameters . Similarly, MEND-CF
/ MEND-ZsRE is additionally trained model on training set Facts:
of CounterFact / ZsRE dataset. Gaston Palewski writes in French
The domain of work of Hermann Klaatsch is anatomy
ROME Rank-One Model Editing (ROME), proposed by [UNKNOWN] 2 Minute Drill is to debut on ESPN
(Meng et al., 2022a), treats MLP module as a key-value [UNKNOWN] Immanuel Wallerstein works in the area of
store. To add a new key-value pair, ROME applies a rank- sociology
one modification to the weights of the MLP, adding the new [UNKNOWN] Nissan Skyline, created by Nissan
information directly. Feng Fei-fei, a citizen of Taiwan
Michie Mee works as actor
[UNKNOWN] Antonio Palomino’s life ended in Madrid
IKE In-context Knowledge Editing (IKE) (Zheng et al.,
[UNKNOWN] Film Forum is within Manhattan
2023) defines three types of demonstration formatting tem-
[UNKNOWN] Phoenix Sky Harbor International Airport
plates including copy, update, and retain, which guide
is located in Phoenix
model to edit knowledge facts by in-context learning (ICL).
Abraham & Straus formed in Brooklyn
The parameters of the model are not updated.
The domain of activity of Joseph Schumpeter is economics
George Buza’s profession is an actor
PROMPT Similar to IKE (Zheng et al., 2023) but simply
[UNKNOWN] Velayudham originated in India
prepends new fact to the LLM prompt. The parameters of
Sophie Calle, a native French
the model are also not updated.
In Nunavut, the language spoken is French
MEMIT MEMIT aims direct model editing via fact tracing [UNKNOWN] The Wide Country is to debut on NBC
and followed by parameter editing. It is an expanded ver- [UNKNOWN] The mother tongue of Porfiry Ivanov is
sion of ROME, which enables the editing of large amounts Russian
of factual data through the updating of a sequence of MLP In Kiiminki, they understand Finnish
layers. [UNKNOWN] Pachomius the Great succumbed at Egypt

ICL To compare to In-Context Learning (ICL) as a base- Input: The domain of work of Hermann Klaatsch
line method in Table 4, we use a prompt which consists is

13
Larimar: Large Language Models with Episodic Memory Control

Output: anatomy Metrics


Config Editor Edit Success Paraphrase Neighb
Input: 2 Minute Drill is to debut on
Output: UNKNOWN S M S M S M
Larimar 100.0 99.7 38.4 -2.9 74.2 1.6
Input: Immanuel Wallerstein works in the area of C1 No Scope 100.0 99.8 37.8 -3.0 22.4 -34.1
Output: UNKNOWN No Memory 23.3 -4.4 26.5 -3.5 77.7 4.7
Input: Nissan Skyline, created by Larimar 100.0 99.9 35.2 -3.5 75.4 2.0
C2 No Scope 100.0 99.9 33.1 -3.6 26.2 -36.2
Output: UNKNOWN No Memory 20.6 -4.9 24.5 -4.1 78.9 5.4
Input: Feng Fei-fei, a citizen of Larimar 100.0 99.8 41.9 0.4 74.8 1.6
Output: Taiwan C3 No Scope 100.0 99.9 41.1 0.4 14.3 -45.8
No Memory 21.6 -4.8 25.4 -3.8 78.4 5.0
Input: Michie Mee works as
Output: actor Table 9. Ablation results for Larimar-1.3B using CounterFact
Input: Gaston Palewski writes in dataset
Output:

Larimar without Scope detector and Larimar without mem-


B. Ablation Results on CounterFact dataset ory. As can be seen, configuration C3 had some edge in
performance. The effect of removing scope detector is re-
In Table 8 we show that when Larimar has access to ad- flected in drop of the neighborhood score. This is expected
ditional fact paraphrases, its paraphrase performance in- since now the model reroutes the prompts from the uncon-
creases from 76.5 to 82.8. Note that in this setup the av- strained decoder to the memory-constrained one, where the
erage number of added paraphrased facts is one and we memory influence makes it harder to cover prompts unre-
queried the model with paraphrased prompts unseen by the lated to in-memory content. On the other hand, removing
memory. Also, observe that the use of the scope detector memory module results in significant decrease in edit suc-
for query detection is crucial for the model’s performance cess and paraphrasing, as now the model has no knowledge
to properly handle the neighborhood prompts. about introduced knowledge facts, at the same time its gen-
eral language abilities are intact and performing well as re-
Editor Edit Success Paraphrase Neighborhood flected in high neighborhood score.
Larimar-6B w/ scope 99.6 76.5 80.2
Larimar-6B +para 99.6 82.8 80.6 In Table 10 the ablation results for GPT-J based model rep-
Larimar-6B +para, no scope 99.6 88.7 16.3
resent results for the following five training configurations:
Table 8. Single fact edit valuation on CounterFact dataset.
Larimar-6B base is the baseline which includes only a single fact • C1: Episode length 5, no KL loss, trained for 5 epochs
in the memory and uses in-scope query detector. Larimar-6B
+para is the version which adds into the memory on average one
• C2: Episode length 16, noise level 1e-4, trained for 8
additional paraphrased fact.
epochs
In Table 9 and 10 we provide ablation results on Larimar
by varying different learning parameters and architectural • C3: Episode length 16, noise level 1e-4, no KL loss,
components of the model and observing performance on trained for 8 epochs
CounterFact dataset. In Table 9 the ablation results for
GPT-2 XL based model are presented. Here we examined • C4: Episode length 8, noise level 1e-4, trained for 8
three different training configurations: epochs

• C1: Episode length 6, observation noise 0.0001, • C5: Episode length 8, noise level 1e-4, no KL loss,
trained for 2 epochs trained for 8 epochs
• C2: Episode length 20, observation noise 0.000001,
trained for 4 epochs Note that the model reported in Table 2 in main paper is
based on configuration C1. Similarly as before, we looked
• C3: Episode length 16, observation noise 0.000001, at architectural changes which included the removal of
trained for 2 epochs scope detector and memory block. We observed that con-
figuration C2 performed the worst, while C1 had overall
Note that the model reported in Table 12 in main paper is better performance. Moreover, the experiments again con-
based on configuration C3. Moreover, we looked at three firmed the benefit of scope detector and the effect of mem-
versions of the Larimar architecture: Original Larimar, ory unit.

14
Larimar: Large Language Models with Episodic Memory Control

Metrics Editor Edit Success Paraphrase Neighborhood


Config Editor Edit Success Paraphrase Neighb
GPT-2 XL 22.2 21.3 24.2
S M S M S M
Larimar 99.6 96.0 76.3 22.1 80.2 3.9 FT 99.6 82.1 23.2
C1 No Scope 99.6 96.1 83.6 23.6 10.4 -32.8 FT+L 92.3 47.2 23.4
No Memory 15.8 -6.8 18.6 -6.8 83.6 6.9 KE 65.5 61.4 24.9
Larimar 42.4 3.4 37.8 -2.7 82.9 6.9 KE-zsRE 92.4 90.0 23.8
C2 No Scope 42.5 3.4 38.8 -2.7 67.8 5.7
MEND 75.9 65.3 24.1
No Memory 15.2 -7.0 18.3 -6.3 83.2 6.9
MEND-zsRE 99.4 99.3 24.1
Larimar 99.9 98.9 68.0 10.8 79.9 3.1
C3 No Scope 99.9 99.0 81.3 11.9 7.2 -48.1 ROME 99.8 88.1 24.2
No Memory 15.0 -6.6 18.5 -6.2 83.6 6.5 Larimar-1.3B 98.1 81.6 19.7
Larimar 91.1 70.8 66.1 16.1 81.6 5.9
C4 No Scope 91.0 70.8 67.9 16.3 27.8 -5.4
GPT-J 26.4 25.8 27.0
No Memory 15.4 -6.9 18.1 -6.0 83.2 6.6 ROME 99.8 95.9 27.2
Larimar 99.9 98.9 72.2 15.1 79.7 3.3 Larimar-6B 94.5 70.4 25.1
C5 No Scope 99.9 99.0 82.9 16.5 6.6 -50.4
No Memory 14.3 -6.9 18.8 -6.3 83.5 6.8
Table 12. Single fact edit valuation on ZsRE dataset. Larimar
closely matches or outperforms gradient based, locate-then-edit
Table 10. Ablation results for Larimar-6B using CounterFact
based, and ICL baselines with training-free memory-conditioned
dataset
generation.

Edit Success Paraphrase Neighborhood D. Additional Counterfact Batch editing


Editor S M S M S M Results
Larimar (ESD) 99.6 96.0 76.3 22.1 80.2 3.9
Larimar(ISD-ep4) 99.8 89.8 83.5 23.5 82.1 6.1 Figure 4 shows the generalization and neighborhood speci-
Larimar(ISD-ep8) 99.9 92.0 82.9 16.5 81.3 5.3
ficity comparison of Larimar with three baselines, MEMIT,
ROME, and MEND. The result indicates Larimar main-
Table 11. Ablation experiment on Larimar-6Busing CounterFact
dataset with different scope detectors: external vs internal (trained
tains generalization performance of single fact editing up
on counterfact data). to a batch size of 512, for larger batches the performance
drops. The neighborhood specificity of Larimar, thanks to
the use of the scope detector, remains very high for all batch
sizes.
C. ZsRE Single Fact Editing Results
We evaluated Larimar on the ZsRE benchmark (Levy et al.,
E. Additional Experimental Details
2017), a QA dataset for relation extraction through reading In several experiments, we compute both reading and writ-
comprehension. See Table 12 for details. ing weights using a Gaussian filter, as follows. Given an
Larimar demonstrates effective editing and paraphrasing on encoding z to be written to memory, and reference mem-
ZsRE, with comparable or slightly lower performance in ory matrix M(ref) , we define the writing weight element
the Neighborhood category, maintaining consistent results wk at memory slot k as
across GPT-2 and GPT-J decoders, underscoring its model-  ||z − Mk,: ||22 
(ref)
agnostic editing capabilities. wk (z|M(ref) ) ∝ exp − , (7)
2ασ 2 (z|M(ref) )
where “∝” Pimplies that we normalize the weight vectors
K
such that k=1 wk = 1, α is a parameter which controls
the entropy or sparsity of the weights (w becomes a one-
hot vector, or multinomial distribution with zero entropy,
as α → 0), and we choose the width function σ(z|M(ref) )
to be the distance from z to the nearest neighbor row in
M(ref) ,
(ref)
σ(z|M(ref) ) := min ||z − Mk,: ||2 . (8)
k

Figure 4. Batch editing on CounterfFct dataset. Baseline perfor- Eq. (7) assigns a lower weight wk to memory locations k
(ref)
mances are taken from (Meng et al., 2023). Green: MEMIT, Or- for which the distance ||z − Mk,: ||2 is large compared to
ange: ROME, Magenta: MEND, Black: Larimar-6B. the nearest-neighbor distance σ(z|M(ref) ).

15
Larimar: Large Language Models with Episodic Memory Control

Sequential editing experiments. Gaussian, 511 facts


1.0
For the sequential editing experiments reported in Table 3 Gaussian, 1000 facts
Pseudoinverse, 511 facts
and Figure 3, we set K = 1000 and use a fixed refer- 0.8 Pseudoinverse, 1000 facts
ence memory M(ref) (see section 3) to compute reading

Holdout (F1)
and writing weights. 0.6
For Table 3, the reference memory is constructed by encod-
ing the prompt for each of the 1000 edits, and placing it in 0.4
one row of M(ref) .
0.2
For Figure 3, the reference memory is constructed by en-
coding the first prompt for each of the 1000 unique facts 0.0
(among the several rephrasings in the edit set which are 0 500 1000 1500 2000 2500
Edits
written to memory) and placing it in a single row in M(ref) .
Thus, when querying memory with an encoded rephrased
Figure 5. Mean F1 score of Larimar, comparing different choices
prompt z in Eq. (7), if z is closest to the row k in M(ref)
for computing reading and writing weights – the Gaussian convo-
corresponding to the same fact, the key vector element wk lution in Eq. (7) and the pseudoinverse method of (Pham et al.,
will be largest for this element, and suppressed for other 2021) – on held-out sets of unseen rephrasings from ZsRE over a
memory locations. (We use α = 10−3 to strongly sup- sequence of 3000 edits. (Black curves are shown in Figure 3 in
press more distant encodings in the reference memory. Em- the main text.)
pirically, we found that that the nearest-neighbor encoding
picked out by Eq. (7) with small α is usually the encoded
prompt for the same fact, with lower F1 scores occurring
mainly in cases where the nearest-neighbor row in M(ref) phrasing of the same fact – but to be sufficient when query-
corresponds to a different fact.) We found that comput- ing with the same prompts used when writing to memory
ing reading and writing weights as in (Pham et al., 2021), (as in Table 4). In this case we compute the writing weight
w = z(M(ref) )† , was not as effective with rephrased facts using the encoding of the prompt of the fact written to
(Figure 3 and Table 13) unless the number of rephrasings memory, W = Zprompt (M(ref) )† (instead of Eq. (7)), and
per fact was relatively large. compute the reading weight in the same way, with the read-
ing prompt differing from the writing prompt in rephrasing
When writing to memory, a trailing period is appended to experiments.
the ground truth label, in order to reduce the likelihood of
the model generating additional text. When evaluating the Lastly, in our batch editing experiment (Figure 2), we com-
F1 score, we remove (in both target and predicted tokens) puted writing weights using the encoded prompt, W =
the token corresponding to a period (13). We also remove Zprompt (M(ref) )† , and computed both writing and read-
the token 198, which corresponds to the new line character ing weights with M(ref) set to the memory matrix obtained
‘\n’, when it is generated as the last token. from Larimar’s training (although we found a Gaussian
random matrix to yield comparable results).
In Figure 5, we compare different variants of Larimar, on
the same task as shown in Figure 3. Relative to the Gaus- Throughout these experiments, we use σw = 0 and ξ = 0.
sian convolution method of Eq. (7), computing reading and
writing weights with the reference memory matrix pseu- F. Generalization via Rephrase-Augmented
doinverse, w = z(M(ref) )† performed well on a dataset Memory
of 511 ZsRE facts and ≈ 20 phrasings per fact, but sig-
nificantly worse on a dataset of 1000 ZsRE with 10 phras- We also evaluate Larimar-1.3B on generalization to unseen
ings per fact. (We hypothesize that Eq. (7) is more effec- rephrasings, by writing a variable number of seen rephrases
tive at finding a nearby rephrase encoding for the same fact of the same fact to memory. After writing Nreph rephras-
when there are only one or a few paraphrases available in ings for each of Nf act facts to memory, we estimate recall
the data.) by querying the model with Nreph unseen rephrasings. (As
in the sequential editing experiment with rephrase queries,
In our fact forgetting experiments (Table 4), we used a sim- we use a reference memory matrix constructed from the
ple reference memory where each matrix element is sam- prompt encodings for the facts written to memory.) In Ta-
(ref)
pled randomly, Mij ∼ N (0, 1). We found this choice to ble 13, we show average recall of the ground-truth answer
be less effective when querying with rephrased prompts – for samples from the ZsRE validation set, revealing gen-
in which case the additional structure of M(ref) described eralization to unseen rephrases. Naturally, for facts with
above helps to locate the nearby encoding of a different more rephrases in memory, recall is higher. We furthermore

16
Larimar: Large Language Models with Episodic Memory Control

(Nf act , Nreph ) Pseudoinverse Gaussian


(20, 10) 0.94 0.90
(40, 5) 0.84 0.84
(100, 2) 0.66 0.78
(200, 1) 0.33 0.69
(1, 1) 0.63 0.68

Table 13. Recall after writing Nreph rephrasings for each of


Nf act ZsRE facts to Larimar-1.3B memory, and querying with
unseen phrasings, using (i) w = z(M(ref) )† (‘pseudoinverse’) or
(ii) Eq. (7), ‘Gaussian.’

compare the Gaussian convolution method of Eq. (7) to


computing reading and writing weights with the reference
memory matrix pseudoinverse, w = z(M(ref) )† . As in
Figure 5, Eq. (7) leads to better recall with fewer rephras-
ings per fact, but falls short when there are many rephras-
ings per fact.

G. Generation Robustness
We assess Larimar’s robustness to sampling noise of the
reading weights (σw ) in terms of edit success and perplex-
ity. To measure edit success, we use 2000 cases from the
CounterFact dataset. For each case, the encoding of the
prompt is concatenated with the ’new target’ to form an
episode, which is written to memory. Next we sample the
weight vector w ∼ N (w̄, σw ), w ∈ RK and take z = wM
to be the read-out vector, which is decoded along with
the prompt. We then report the edit success. To measure Figure 6. Generation perplexity and single fact edit success as
perplexity, we consider 1000 samples from the Wikipedia a function of varying magnitude of σw for Larimar-6B. (Re-
dataset. For each sentence, we write it into Larimar mem- sults show that our Zreadout is robust to noise in the address-
ory, and take the first 10 characters of the sentence as our ing/memory matrix and also leads to the correct response from
prompt. We then perform generation as above. We repeat the decoders)
these steps for each of the 1000 sentences and then this text
is fed into GPT2 large model to compute perplexity.
In Figure 6, we report the perplexity and rewrite success
metrics as a function of σw , averaged over 3 independent
runs. Overall the results indicate that Larimar is fairly ro-
bust to increased noise variance up to a range.

17

You might also like