This repository contains the code base for producing the experiments from our paper Memorization in Attention-only Transformers accepted at AISTATS 2025 and available here [Todo]!
The notebook experiments.ipynb allows to produce the data present in the folder Scaling laws. Each .py in that folder allows to analyze the data and create the graphs present in the paper. These images are stored in the folder Images.
This is a joint work with Yann Chevaleyre and Muni Sreenivas Pydi, realized at Lamsade, Paris Dauphine University. [Todo]
Our main theoretical result expressed in Corollary 1 of our paper, is a lower bound on the number of example that an Attention-only Transformer (AoT) can remember. This in turn gives a lower bound on the accuracy of a (well) trained AoT. The accuracy scales at least as:
In experiment 1 to 4, we test the accuracy scaling of an AoT in the parameters
![]() |
![]() |
|---|
For the scaling in
![]() |
![]() |
|---|
In experiment 5, we measured the accuracy of a Transformer with only one attention head (to mix tokens) and an MLP with a varying size. This allowed us to compare the two architectures' accuracy for a given number or parameter. We found that both scalings are the same meaning that MLP and Attention layer can memorize equally well in practice (although this might depend on the optimization procedure).
Appendices to the paper. We include variations of experiments 1, 2 and 5 featuring larger embedding dimension as well as more depth. We also include experiment 6, which compares the lower bound with the scalings in dimension
![]() |
![]() |
|---|---|
![]() |
![]() |
| - | - |
![]() |
![]() |
| - | - |











