A Solvable Attention For Neural Scaling Laws
A Solvable Attention For Neural Scaling Laws
A BSTRACT
Transformers and many other deep learning models are empirically shown to pre-
dictably enhance their performance as a power law in training time, model size,
or the number of training data points, which is termed as the neural scaling law.
This paper studies this intriguing phenomenon particularly for the transformer ar-
chitecture in theoretical setups. Specifically, we propose a framework for linear
self-attention, the underpinning block of transformer without softmax, to learn in
an in-context manner, where the corresponding learning dynamics is modeled as
a non-linear ordinary differential equation (ODE) system. Furthermore, we estab-
lish a procedure to derive a tractable approximate solution for this ODE system by
reformulating it as a Riccati equation, which allows us to precisely characterize
neural scaling laws for linear self-attention with training time, model size, data
size, and the optimal compute. In addition, we reveal that the linear self-attention
shares similar neural scaling laws with several other architectures when the con-
text sequence length of the in-context learning is fixed, otherwise it would exhibit
a different scaling law of training time.
1 I NTRODUCTION
Large language models (LLMs) (e.g., GPT (Brown et al., 2020) and Llama (Meta, 2024)) have
made significant achievements across a variety of tasks, ranging from question answering to decision
making. Adopting the transformer architecture (Vaswani et al., 2017), these LLMs are large in the
sense of both parameters and training data, e.g., the largest Llama 3 model has 405B parameters
and is trained on 15.6T tokens (Meta, 2024). One of the most fantastic phenomena of such LLMs
is their continuing performance gaining as the model size and training steps are scaled up. More
remarkably, their performance can behave predictably as a power law in the number of parameters,
computation or data size (Kaplan et al., 2020; Hoffmann et al., 2022). This impressive power law
behavior is termed as neural scaling laws.
In particular, for a model with D trainable parameters, neural scaling laws state that the test loss
L(D, t) should obey L(D, t) = E + AD−β + Bt−γ (Kaplan et al., 2020; Hoffmann et al., 2022)
where t is the number of optimization steps and E captures the loss for a generative process on the
data distribution. Holding across a wide range of orders of magnitude, these neural scaling laws have
led to the fundamental belief that autoregressive transformer language models could successively
improve their performance when scaling up. Interestingly, they also allow practitioners to determine
the trade-off between model size and training time for a fixed compute budget (Hoffmann et al.,
2022) or design dataset with clever pruning (Sorscher et al., 2022).
Given the significant role of neural scaling laws, the theoretical understanding of their origin and
mechanism such as values of their exponents becomes increasingly important recently. Hutter (2021)
designed a linear model that can exhibit power laws and showed that not all data distributions lead to
power laws; Maloney et al. (2022) applied the random matrix theory to identify necessary properties
of scaling laws and proposed a statistical model that captures the neural scaling laws; Bordelon et al.
(2024); Nam et al. (2024) proposed different solvable models to reveal the existence of scaling laws.
Although these initial attempts made simplifications on model architectures and data for the purpose
of analytical tractability, they largely advanced our understandings of neural scaling laws from the
theoretical perspective.
1
Published as a conference paper at ICLR 2025
On the other hand, one important aspect commonly absent in these works is that they did not consider
the transformer architecture, the universal architecture of current LLMs, which leads the theoretical
understanding for neural scaling laws of modern LLMs to be still underexplored. Transformers are
special not only because they employ self-attention as the primary component, but also because
the way how they perform prediction—an incredible mechanism called in-context learning (Brown
et al., 2020; Garg et al., 2023) that can adapt their predictions based on data given in context.
The uniqueness of transformer definitely gives rise to many intriguing questions from a theoretical
perspective. What are the origins of neural scaling laws of transformer? Does transformer induce
different neural scaling laws compared to other models? Will in-context learning (e.g., context
sequence length) affect neural scaling laws? Due to the importance of transformer and its neural
scaling laws, investigating these questions is of great interest and necessary.
Answering these questions from a theoretical perspective requires a thorough understanding of ex-
plicit forms of model predictions during training, which, however, is hard since it typically requires
solving non-linear ODEs that usually do not admit closed-form solutions. Towards this direction,
Saxe et al. (2014) modelled the learning dynamics of deep linear networks as the logistic differential
equation that can be solved exactly, Pinson et al. (2023) solved the dynamics of linear convolution
neural networks, and Bordelon et al. (2024) applied a DMFT approach from statistical physics to
solve random feature models. For transformers, recently Zhang et al. (2023); Tarzanagh et al. (2024)
established the forms of converged parameters in regression and classification settings. However,
explicit forms of parameters along the training trajectory are still unclear, leading to a gap when
investigating neural scaling laws for transformers.
In this paper, we attempt to provide initial answers for the aforementioned questions to fill the gap
in part and take a step towards understanding neural scaling laws of LLMs. To conduct an amenable
analysis, we focus on the self-attention, which stands at the core of the transformer architecture,
in the linear case. We note that linear self-attention has been widely adopted in recent works (von
Oswald et al., 2023; Li et al., 2023b; Zhang et al., 2023) to study properties of transformers. Despite
that feature learning is absent, it has the advantage of providing the possibility for a clear theoretical
characterization. We discuss more related works on learning dynamics, neural scaling laws, and the
analysis of in-context learning for (linear) self-attention in Appendix B.
Our Contributions.
1. We design a multitask sparse feature regression (MSFR) problem for the linear self-
attention block to learn in an in-context manner. More importantly, we derive a tractable
solution for linear self-attention by modelling its in-context learning dynamics in the
MSFR problem as a non-linear ODE system and reformulating the system to a set of Ric-
cati equations. This is highly nontrivial since non-linear ODE systems are hard to solve,
thus our procedure might be of broad interest.
This solution captures dynamical behaviors of linear self-attention during training explic-
itly. To the best of our knowledge, this is the first closed-form solution of self-attention
along the training trajectory. We highlight that it can be applied as an interesting proxy for
investigating properties of self-attention and transformers due to its analytical tractability.
2. Built upon this solution, we characterize neural scaling laws of linear self-attention by
varying time, the size of model, or the number of training data points when data obeys a
power-law, which then gives us the scaling law in the optimal compute budget. In addition,
we are able to characterize the role of context sequence length in neural scaling laws,
revealing that if it obeys a different power-law then the time scaling law will be affected,
otherwise linear self-attention would share similar neural scaling laws with other models,
which well aligns with empirical observations in Kaplan et al. (2020).
2 S ETUP OF F RAMEWORK
Notations. We use {1, . . . , N } to denote all integers between 1 and N . For two vectors a, b ∈ Rd ,
we use aj to denote its j-th component, a ⊙ b to denote the elementwise product, a · b to denote
the inner product, and diag(a) to denote the d × d matrix with its diagonal elements equal to a. We
use ȧ to denote the derivative of a with respect to time. We let δs,s′ be 1 if s = s′ and 0 otherwise.
2
Published as a conference paper at ICLR 2025
For a matrix A, we use Ai,j to denote its i-th row j-th column component. We use a ∼ P to denote
that a is sampled from distribution P. We use 0d ∈ Rd to denote the zero vector in Rd .
In Section 2.1, we define the problem setting of MSFR problem, and present the concept of in-
context learning and the generation of in-context learning data for it in Section 2.2. Finally, in
Section 2.3, we describe details of linear self-attention block.
There are Ns different tasks in total. We let S be the random variable of picking a specific task
among Ns tasks and assume that S follows a power law distribution:
Pα (S = s) = Zs−α (1)
PNs −α
where Z = s=1 s is the normalization constant and α > 1. Since we focus on linear self-
attention, we assume the existence of a non-linear sparse feature extractor to perform the feature
learning. Specifically, for an input data vector x ∈ Rd and a task type s ∈ {1, . . . , Ns }, there exists
a unique feature extractor
ϕ(s, x) : R × Rd 7→ {−1, 0, 1}Ns ∈ RNs , (2)
where only the s-th component of ϕ(s, x) can be nonzero, i.e., ϕs′ (s, x) = ±δs′ ,s . Furthermore,
given task type s, we let the strength for task s be Λs ∈ R. The target y ∈ R is now defined through
Ns
X
y(s, x) = Λs ϕk (s, x). (3)
k=1
We elaborate two properties of this problem before moving on. (i) The reason why this problem is
termed as “multitask” is because we have Ns different tasks such that each has its own task strength
Λs and feature extractor ϕ(s, x), meaning that the model should learn distinct Λs for each task.
(ii) If we let Λ ∈ RNs be the collection of all task strengths, then the target can be written as
y(s, x) = Λ · ϕ(s, x), which is like a linear regression over the feature ϕ(s, x). Since ϕ(s, x) is
like a one-hot vector, the problem is a “regression with sparse feature”. The subtlety lies in that we
must rely on all task types to learn the complete Λ when compared to standard linear regression.
Therefore, our problem is defined as “multitask sparse feature regression”.
A remarkable ability of LLMs is that they can perform in-context learning to adapt to a specific
task given a context in the form of instructions (Brown et al., 2020). More specifically, the goal
of in-context learning is to enable a learner (e.g., a transformer) to use the context data to make a
prediction for the query data. To incorporate this ability, we focus on in-context learning in this
paper, and present its details formally for the MSFR problem (Section 2.1) in this section.
Generation of in-context data. Given task type s, we let the corresponding context sequence
length ψs = F(s) ∈ R and task strength Λs = G(s) ∈ R, i.e., each task s has a constant context
sequence length ψs and a constant task strength Λs determined by the maps F and G, respectively.
The generation is composed of four parts (see Fig. 1): (i) a task type s ∈ {1, . . . , Ns } is first sampled
from the distribution Pα (S = s) (Eq. (1)), which gives us the corresponding context sequence
length ψs and task strength Λs ; (ii) we sample ψs different input vectors x ∈ Rd and a query
vector x̂ ∈ Rdfrom the input data distribution PX , then these data vectors are organized to form
a matrix X = x(1) x(2) · · · x(ψs ) x̂ ∈ Rd×(ψs +1) ; (iii) we apply the feature extractor ϕ
to each column x(i) of X to obtain the sparse feature ϕ(s, x(i) ) ∈ RNs and generate the target
y (i) := y(s, x(i) ) and ŷ := y(s, x̂) according to Eq. (3), then one in-context data point of the task s
can now be generated as
ϕ(1) · · · ϕ(ψs ) ϕ̂ ϕ(s, x(1) ) · · · ϕ(s, x(ψs ) ) ϕ(s, x̂)
Φ(s, X) := (1) = ; (4)
y · · · y (ψs ) 0 y(s, x(1) ) · · · y(s, x(ψs ) ) 0
(iv) repeating the above procedure for N times can give us an in-context dataset with N data points,
where the numbers of data points for different tasks obey the power law Eq. (1). Finally, given in-
context data Φ(s, X) ∈ R(Ns +1)×(ψs +1) (Eq. (4)) and loss function L, in-context learning aims to
3
Published as a conference paper at ICLR 2025
Sequence Length
<latexit sha1_base64="JOWe+h5TWizQvK0lQLDEomKDCok=">AAAB/3icbVA9SwNBEN2LXzF+RQUbm8UgWIW7KGgZtLGwiGg+IAlhbzNJluztnbtzYjhT+FdsLBSx9W/Y+W/cfBSa+GDg8d4MM/P8SAqDrvvtpBYWl5ZX0quZtfWNza3s9k7FhLHmUOahDHXNZwakUFBGgRJqkQYW+BKqfv9i5FfvQRsRqlscRNAMWFeJjuAMrdTK7jUQHjC5gbsYFAd6BaqLvWErm3Pz7hh0nnhTkiNTlFrZr0Y75HEACrlkxtQ9N8JmwjQKLmGYacQGIsb7rAt1SxULwDST8f1DemiVNu2E2pZCOlZ/TyQsMGYQ+LYzYNgzs95I/M+rx9g5ayZCRTHa7yaLOrGkGNJRGLQtNHCUA0sY18LeSnmPacbRRpaxIXizL8+TSiHvHecL1ye54vk0jjTZJwfkiHjklBTJJSmRMuHkkTyTV/LmPDkvzrvzMWlNOdOZXfIHzucPcouWXw==</latexit>
Step (i)
<latexit sha1_base64="jOetQIDSJmw4mulUVKeNiuL86Es=">AAACIXicdVDLSgMxFM34tr6qLt0Ei1ChlGltsUvRjUtF2wrtUDLpbRvMPEjuiGWYX3Hjr7hxoYg78WfMTFtQ0Qshh3POvbk5biiFRtv+sObmFxaXlldWc2vrG5tb+e2dlg4ixaHJAxmoG5dpkMKHJgqUcBMqYJ4roe3enqV6+w6UFoF/jeMQHI8NfTEQnKGhevlG3M2GdNTQdWK7XLdNVUt2Ob1r9dKUqSddhHuMrxBCWhSHSdLLF2ZuOnPTmZtWMsa2C2RaF738e7cf8MgDH7lkWncqdohOzBQKLiHJdSMNIeO3bAgdA33mgXbibLmEHhimTweBMsdHmrHfO2LmaT32XOP0GI70by0l/9I6EQ4aTiz8MELw+eShQSQpBjSNi/aFAo5ybADjSphdKR8xxTiaUHMmhNlP6f+gVS1XjsrVy1rh5HQaxwrZI/ukSCrkmJyQc3JBmoSTB/JEXsir9Wg9W2/W+8Q6Z017dsmPsj6/ACfZn9I=</latexit>
Step (ii)
<latexit sha1_base64="heiga28ZXKdT61eiIrTK1ddCtsA=">AAACInicdVDLSgMxFM34tr5GXboJFkFByrS2qDvRjcuKVoXpUDLpbQ3NPEjuiGWYb3Hjr7hxoagrwY8xbaegohdCDuece3Nz/FgKjY7zYU1MTk3PzM7NFxYWl5ZX7NW1Sx0likODRzJS1z7TIEUIDRQo4TpWwAJfwpXfOxnoV7egtIjCC+zH4AWsG4qO4AwN1bIP0+ZwiKu6vpc6pZpjqrLrlAZ3tbabM7WsiXCH6TlCTLeF2Mmyll0c2+nYTsd2Wh4yjlMkedVb9luzHfEkgBC5ZFq7ZSdGL2UKBZeQFZqJhpjxHuuCa2DIAtBeOtwuo1uGadNOpMwJkQ7Z7x0pC7TuB75xBgxv9G9tQP6luQl2DrxUhHGCEPLRQ51EUozoIC/aFgo4yr4BjCthdqX8hinG0aRaMCGMf0r/B5eVUnmvVDmrFo+O8zjmyAbZJNukTPbJETklddIgnNyTR/JMXqwH68l6td5H1gkr71knP8r6/AL9X6BF</latexit>
⇥ (1) ⇤
s ⇠ P↵ (S = s)
<latexit sha1_base64="kUeW13BJuz/pcPx5uc0F0sb91Uk=">AAACCXicbVDLSsNAFJ34rPUVdelmsAh1U5Iq6EYounFZ0T6gCeFmOm2HTh7MTIQSsnXjr7hxoYhb/8Cdf+OkzUJbD1w4nHMv997jx5xJZVnfxtLyyuraemmjvLm1vbNr7u23ZZQIQlsk4pHo+iApZyFtKaY47caCQuBz2vHH17nfeaBCsii8V5OYugEMQzZgBJSWPBNL7EgWYCcANSLA02bmpQ7weARZ9e5SnnhmxapZU+BFYhekggo0PfPL6UckCWioCAcpe7YVKzcFoRjhNCs7iaQxkDEMaU/TEAIq3XT6SYaPtdLHg0joChWeqr8nUgiknAS+7swPlvNeLv7n9RI1uHBTFsaJoiGZLRokHKsI57HgPhOUKD7RBIhg+lZMRiCAKB1eWYdgz7+8SNr1mn1aq9+eVRpXRRwldIiOUBXZ6Bw10A1qohYi6BE9o1f0ZjwZL8a78TFrXTKKmQP0B8bnD+aomdI=</latexit>
x 2 R ⇠ PX (s, x)
<latexit sha1_base64="d1N3Zk4nYZAPIhVnaf2X2rbvC+8=">AAACPnicbVA9T8MwFHT4LOUrwMhiUSEVCVVJQYKxgoWxINJWakLlOG5r1XEi20FUUX4ZC7+BjZGFAYRYGXHSDtDyJMune++ez+fHjEplWS/GwuLS8spqaa28vrG5tW3u7LZklAhMHByxSHR8JAmjnDiKKkY6sSAo9Blp+6PLvN++J0LSiN+qcUy8EA047VOMlKZ6ppO6xZLUZwiPMghdP2KBHIf6gg/QpRy6IVJD309vsrsAupKGEwYjljazXqcqj2dUR1nPrFg1qyg4D+wpqIBpNXvmsxtEOAkJV5ghKbu2FSsvRUJRzEhWdhNJYm0QDUhXQ45CIr20cJ7BQ80EsB8JfbiCBftbkaJQ5t70ZG5czvZy8r9eN1H9cy+lPE4U4XjyUD9hUEUwzxIGVBCs2FgDhAXVXiEeIoGw0omXdQj27JfnQates09q9evTSuNiGkcJ7IMDUAU2OAMNcAWawAEYPIJX8A4+jCfjzfg0viajC8ZUswf+lPH9Awskr5I=</latexit>
d
x x(2) <latexit sha1_base64="BBNvdhNOhhbtNZxNXuTqQPVju/0=">AAACpnicbVHLjtMwFHXCawivAks2FhWoCKlKChIsR7CBDRoGOh2p7hTbuW2tcZzIvkFTWfk0foIdf4OTRoiZzpUsH5/7OvdaVFo5TNM/UXzj5q3bdw7uJvfuP3j4aPD4yYkraythKktd2lPBHWhlYIoKNZxWFnghNMzE+cfWP/sJ1qnSfMdtBYuCr41aKckxUMvBLyZgrYwXBUerLpqEUspEqXO3LcJFL878KHvV0Jf79KSjmcxLdNf5WeXU0u1yPeukegt5wzYc/eXopqEJA5P/kxGqKUNZeG2E8MfNmc8ZqgIc7cvS1zTIWg6G6TjtjO6DrAdD0tvRcvCb5aWsCzAoNXdunqUVLjy3qKSGJmG1g4rLc76GeYCGh5YL32lv6IvA5HRV2nAM0o79P8PzwrUThchWuLvqa8nrfPMaV+8XXpmqRjBy12hVa4olbf+M5sqCRL0NgEurglYqN9xyieFnk7CE7OrI++BkMs7ejCdf3w4PP/TrOCDPyHMyIhl5Rw7JJ3JEpkRGw+hzdBx9i0fxl3gaz3ahcdTnPCWXLP7xF0Eaziw=</latexit>
··· x( s)
x̂ 2 Rd⇥( s +1)
Task Type
<latexit sha1_base64="55GUOD2DyhHXze3hdS45V1MZ4Eo=">AAAB+XicbVDLSgNBEOyNrxhfqx69DAbBU9iNgh6DXjxGyAuSJcxOZpMhsw9meoNhyZ948aCIV//Em3/jJNmDJhY0FFXddHf5iRQaHefbKmxsbm3vFHdLe/sHh0f28UlLx6livMliGauOTzWXIuJNFCh5J1Gchr7kbX98P/fbE660iKMGThPuhXQYiUAwikbq23YP+RNmDarHpGH8Wd8uOxVnAbJO3JyUIUe9b3/1BjFLQx4hk1Trrusk6GVUoWCSz0q9VPOEsjEd8q6hEQ259rLF5TNyYZQBCWJlKkKyUH9PZDTUehr6pjOkONKr3lz8z+umGNx6mYiSFHnElouCVBKMyTwGMhCKM5RTQyhTwtxK2IgqytCEVTIhuKsvr5NWteJeVaqP1+XaXR5HEc7gHC7BhRuowQPUoQkMJvAMr/BmZdaL9W59LFsLVj5zCn9gff4Aw5+TvQ==</latexit>
Data
<latexit sha1_base64="pCwvmKFQVEjIPW4mFLxrIMNN3TY=">AAAB8nicbVBNS8NAEN34WetX1aOXxSJ4KkkV9FjUg8cK9gPaUDbbSbt0swm7E7GE/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvSKQw6Lrfzsrq2vrGZmGruL2zu7dfOjhsmjjVHBo8lrFuB8yAFAoaKFBCO9HAokBCKxjdTP3WI2gjYvWA4wT8iA2UCAVnaKVOF+EJs1uGbNIrld2KOwNdJl5OyiRHvVf66vZjnkagkEtmTMdzE/QzplFwCZNiNzWQMD5iA+hYqlgExs9mJ0/oqVX6NIy1LYV0pv6eyFhkzDgKbGfEcGgWvan4n9dJMbzyM6GSFEHx+aIwlRRjOv2f9oUGjnJsCeNa2FspHzLNONqUijYEb/HlZdKsVrzzSvX+oly7zuMokGNyQs6IRy5JjdyROmkQTmLyTF7Jm4POi/PufMxbV5x85oj8gfP5A6jkkX8=</latexit>
Data Matrix
<latexit sha1_base64="DNRknLIfPNzstZ3oJqpJ6fBSy6g=">AAAB+3icbVDLSgNBEJyNrxhfazx6GQyCp7AbBT0G9eBFiGAekITQO5kkQ2YfzPRKwrK/4sWDIl79EW/+jZNkD5pY0FBUddPd5UVSaHScbyu3tr6xuZXfLuzs7u0f2IfFhg5jxXidhTJULQ80lyLgdRQoeStSHHxP8qY3vpn5zSeutAiDR5xGvOvDMBADwQCN1LOLHeQTTG4Bgd4DKjFJe3bJKTtz0FXiZqREMtR69lenH7LY5wEyCVq3XSfCbgIKBZM8LXRizSNgYxjytqEB+Fx3k/ntKT01Sp8OQmUqQDpXf08k4Gs99T3T6QOO9LI3E//z2jEOrrqJCKIYecAWiwaxpBjSWRC0LxRnKKeGAFPC3ErZCBQwNHEVTAju8surpFEpu+flysNFqXqdxZEnx+SEnBGXXJIquSM1UieMTMgzeSVvVmq9WO/Wx6I1Z2UzR+QPrM8fM7OUiw==</latexit>
Query Data
<latexit sha1_base64="klQAUkvc+aUbvccBUdy7tjfP80w=">AAACJHicbVC7SgNBFJ2N7/iKWtoMBsFCwu4mGsVG1MJSwZhAsoTZyU0yZPbBzF0xLPsxNv6KjYUPLGz8FndjCk08MHA459yZO8cNpdBomp9GbmZ2bn5hcSm/vLK6tl7Y2LzVQaQ41HggA9VwmQYpfKihQAmNUAHzXAl1d3Ce+fU7UFoE/g0OQ3A81vNFV3CGqdQunMSt0SVN1XOd2CwdHx3aB/a+WTLNql0+zIhdrdjlpIVwj/F1BGpILxiyJMm3C8Usl4FOE2tMimSMq3bhrdUJeOSBj1wyrZuWGaITM4WCS0jyrUhDyPiA9aCZUp95oJ14tF9Cd1OlQ7uBSo+PdKT+noiZp/XQc9Okx7CvJ71M/M9rRtg9cmLhhxGCz38e6kaSYkCzxmhHKOAohylhXIl0V8r7TDGOaa9ZCdbkl6fJrV2yyiX7ulI8PRvXsUi2yQ7ZIxapklNySa5IjXDyQJ7IC3k1Ho1n4934+InmjPHMFvkD4+sb2bahRA==</latexit>
Step (iv)
<latexit sha1_base64="3jaDE1kneD5a8U497QaHiQaRWJg=">AAACInicbVDLSgMxFM34rPVVdekmWIQKpcxUi7orunGpaFVoh5JJb9vQzIPkjliG+RY3/oobF4q6EvwY03YWaj0Qcjjn3pub40VSaLTtT2tmdm5+YTG3lF9eWV1bL2xsXuswVhwaPJShuvWYBikCaKBACbeRAuZ7Em68wenIv7kDpUUYXOEwAtdnvUB0BWdopHbhOGmNhzRVz3MTu1KzDapluzK6D2rlTKmlLYR7TC4RIloSd3tp2i4UJ1W2TaeJk5EiyXDeLry3OiGPfQiQS6Z107EjdBOmUHAJab4Va4gYH7AeNA0NmA/aTcbbpXTXKB3aDZU5AdKx+rMjYb7WQ98zlT7Dvv7rjcT/vGaM3SM3EUEUIwR88lA3lhRDOsqLdoQCjnJoCONKmF0p7zPFOJpU8yYE5++Xp8l1teLsV6oXB8X6SRZHjmyTHVIiDjkkdXJGzkmDcPJAnsgLebUerWfrzfqYlM5YWc8W+QXr6xvrv6A4</latexit>
Task Strength
<latexit sha1_base64="ds5zuJA8BBBQLrh9+Riceo6ez0s=">AAAB/XicbVDLSgNBEJz1GeMrPm5eBoPgKexGQY9BLx4j5gVJCLOTTjJkdnaZ6RXjEvwVLx4U8ep/ePNvnCR70MSChqKqm+4uP5LCoOt+O0vLK6tr65mN7ObW9s5ubm+/ZsJYc6jyUIa64TMDUiiookAJjUgDC3wJdX94PfHr96CNCFUFRxG0A9ZXoic4Qyt1cocthAdMKswM6R1qUH0cjDu5vFtwp6CLxEtJnqQod3JfrW7I4wAUcsmMaXpuhO2EaRRcwjjbig1EjA9ZH5qWKhaAaSfT68f0xCpd2gu1LYV0qv6eSFhgzCjwbWfAcGDmvYn4n9eMsXfZToSKYgTFZ4t6saQY0kkUtCs0cJQjSxjXwt5K+YBpxtEGlrUhePMvL5JaseCdFYq35/nSVRpHhhyRY3JKPHJBSuSGlEmVcPJInskreXOenBfn3fmYtS456cwB+QPn8wf+cpWS</latexit>
Feature Extractor
<latexit sha1_base64="c4lSpDiINxeAOH9M9Zj5+ULkmAA=">AAACKnicbZDLSgMxFIYz9VbrrerSTbAILqTMVNF2VxXFZQVbhXYomfS0Dc1cSM6IZZjnceOruHGhiFsfxEzbhbcfAh//OSc5+b1ICo22/W7l5uYXFpfyy4WV1bX1jeLmVkuHseLQ5KEM1Z3HNEgRQBMFSriLFDDfk3Drjc6z+u09KC3C4AbHEbg+GwSiLzhDY3WLp0lncklbDTw3scu20fHxQQZO1XYM1GrVSqWWdhAeMLkEhrECevGAinEMVZp2i6XpmG3Tv+DMoERmanSLL51eyGMfAuSSad127AjdhCkUXEJa6MQaIsZHbABtgwHzQbvJZMuU7hmnR/uhMidAOnG/TyTM13rse6bTZzjUv2uZ+V+tHWO/6iYiiGKEgE8f6seSYkiz3GhPKOAoxwYYV8LsSvmQZSGYdAsmBOf3l/9Cq1J2DsuV66NS/WwWR57skF2yTxxyQurkijRIk3DySJ7JK3mznqwX6936mLbmrNnMNvkh6/MLk8ukZw==</latexit>
Repeat N times
<latexit sha1_base64="bmPKkkx9BNG9KbL8Le01VlIUJxE=">AAACBXicbVC7SgNBFJ31GeNr1VKLwSBYhd0oaBm0sZIo5gHJEmYnN8mQ2Qczd8WwpLHxV2wsFLH1H+z8GyfJFpp4YODcc+7lzj1+LIVGx/m2FhaXlldWc2v59Y3NrW17Z7emo0RxqPJIRqrhMw1ShFBFgRIasQIW+BLq/uBy7NfvQWkRhXc4jMELWC8UXcEZGqltH7QQHjC9hRgY0tH1tKQoAtCjtl1wis4EdJ64GSmQDJW2/dXqRDwJIEQumdZN14nRS5lCwSWM8q1EQ8z4gPWgaWjIzBYvnVwxokdG6dBupMwLkU7U3xMpC7QeBr7pDBj29aw3Fv/zmgl2z71UhHGCEPLpom4iKUZ0HAntCAUc5dAQxpUwf6W8zxTjaILLmxDc2ZPnSa1UdE+KpZvTQvkiiyNH9skhOSYuOSNlckUqpEo4eSTP5JW8WU/Wi/VufUxbF6xsZo/8gfX5AzwqmQo=</latexit>
Step (iii)
<latexit sha1_base64="BmGGM/W68aRwEncS//dthaROJU4=">AAACI3icdVDLSgMxFM3UV62vUZdugkVQkDKtFsWV6MalolWhHUomvW1DMw+SO2IZ5l/c+CtuXCjFjQv/xUydgopeCDmcc+7NzfEiKTQ6zrtVmJqemZ0rzpcWFpeWV+zVtWsdxopDg4cyVLce0yBFAA0UKOE2UsB8T8KNNzjN9Js7UFqEwRUOI3B91gtEV3CGhmrbR0lrPKSpep6bOJW6Y6q261Sye7++mzP1tIVwj8klQkS3hRA7adq2yxM/nfjpxE+rY8ZxyiSv87Y9anVCHvsQIJdM62bVidBNmELBJaSlVqwhYnzAetA0MGA+aDcZr5fSLcN0aDdU5gRIx+z3joT5Wg99zzh9hn39W8vIv7RmjN1DNxFBFCME/OuhbiwphjQLjHaEAo5yaADjSphdKe8zxTiaWEsmhMlP6f/gulap7lVqF/vl45M8jiLZIJtkm1TJATkmZ+ScNAgnD+SJvJBX69F6tkbW25e1YOU96+RHWR+f01+guA==</latexit>
(1) (2)
(s, x ) (s, x ) ··· (s, x( s ) ) (s, x̂)
2 R(Ns +1)⇥( s +1)
<latexit sha1_base64="vsqDm/qdZXOqiGLrLBxfu25ZiGQ=">AAADTnicdVJNj9MwEHWyfJTy1YUjF4sK1ApUJQUJjiu4cEILorsr1SVyHLe11rEj20EbWf6FXBA3fgYXDiAEzodQ2y2WoozfzJv3ZuS04EybKPoWhAdXrl673rvRv3nr9p27g8N7J1qWitAZkVyqsxRrypmgM8MMp2eFojhPOT1Nz1/X+dNPVGkmxQdTFXSR45VgS0aw8VByGGQopSsmbJpjo9iF60OIUskzXeX+Z1GxZm6kn26h8OKjHcVjN4aPt/C6eG/ttK1FJJNG75Dgf1mo0CzRLdWiZlaraOb2aqI1Nna7hRs7iJAfqGrt+jZVa2bTS7WhtCMUuT6iIvu3Gy/MBET+tk5T+97VzPpCMLdvXaLhExiPkWE51bDr2UAuGQyjSdQceDmIu2AIunOcDL6iTJIyp8IQjrWex1FhFhYrwwin3lapaYHJOV7RuQ8F9pIL2zh38JFHMriUyn/CwAbdZFic63pHvrJ2r3dzNbgvNy/N8uXCMlGUhgrSCi1LDo2E9duCGVOUGF75ABPFvFdI1lhhYvwL7PslxLsjXw5OppP42WT67vnw6FW3jh54AB6CEYjBC3AE3oBjMAMk+Bx8D34Gv8Iv4Y/wd/inLQ2DjnMfbJ2D3l+T9QtC</latexit>
y (1) y (2) ··· y( s ) 0
In-context data
<latexit sha1_base64="svSMZhmrbLT1aDjjdIuWUA3J/1s=">AAAB/3icbZDLSsNAFIYnXmu9RQU3bgaL4MaSVEGXRTe6q2Av0IYymUzboZNJmDkRS+zCV3HjQhG3voY738ZJm4W2/jDw8Z9zOGd+PxZcg+N8WwuLS8srq4W14vrG5ta2vbPb0FGiKKvTSESq5RPNBJesDhwEa8WKkdAXrOkPr7J6854pzSN5B6OYeSHpS97jlICxuvZ+B9gDpDfyhEYyQxwQIOOuXXLKzkR4HtwcSihXrWt/dYKIJiGTQAXRuu06MXgpUcCpYONiJ9EsJnRI+qxtUJKQaS+d3D/GR8YJcC9S5knAE/f3REpCrUehbzpDAgM9W8vM/2rtBHoXXsplnACTdLqolwgMEc7CwAFXjIIYGSBUcXMrpgOiCAUTWdGE4M5+eR4albJ7Wq7cnpWql3kcBXSADtExctE5qqJrVEN1RNEjekav6M16sl6sd+tj2rpg5TN76I+szx9MjJZH</latexit>
learn a model f : R(Ns +1)×(ψs +1) → R such that θ ∗ = arg minθ L(f (Φ; θ), ŷ). Note that MSFR
can be seen as a limiting case of in-context regression under source/capacity condition (generaliza-
tion of setup in Lu et al. (2024) with that of Cui et al. (2022), see Appendix H.1).
Self-attention block stands at the core of the transformer architectures (Vaswani et al., 2017). A
single-head self-attention block (without residual connection) f : Rd×dL 7→ Rd×dL parameterized
by θ updates an input G ∈ Rd×dL to
Ĝ := f (G; θ) = P V G softmax (WK G)T (WQ G) ∈ Rd×dL
we can write the output of the linear self-attention block for the task s as f (Φ(s, X); θ) =
[V ΦΦT W ϕ̂]s which will be used in the rest part of this paper.
Section 2 establishes our in-context learning framework of multitask sparse feature regression prob-
lem for linear self-attention. In this section, we will closely investigate the corresponding learning
dynamics by modelling it as non-linear ODE systems in Section 3.1 and give a tractable solution of
it in Section 3.2.
4
Published as a conference paper at ICLR 2025
Given the in-context dataset generated according to the procedure described in Section 2.2 with N
data points {Φ(s(n) , X(n) )}N
n=1 , we use the mean-squared error (MSE) loss such that
1 X 2
N
L̃(θ) = f Φ(s(n) , X(n) ); θ − ŷ (n) (6)
2N n=1
where L̃ is the empirical loss. The goal of in-context learning now becomes the traditional empirical
loss minimization θ ∗ = arg minθ L̃(θ), which can be solved by various optimization algorithms,
and we focus on the general gradient descent in the continuous time limit, i.e., gradient flow (GF):
V̇ = −∇V L̃(V , W ), Ẇ = −∇W L̃(V , W ).
To further investigate the learning dynamics, considering the formulation of the feature extrac-
tor Eq. (2) and in-context data Eq. (4) and denoting the standard basis vector in RNs as es =
T
(0 · · · 0 1 0 · · ·) ∈ RNs for s ∈ {1, . . . , Ns } such that the only nonzero component of
es is its s-th component, we find that Hs ∈ R(Ns +1)×(Ns +1) defined by (Appendix D.1)
diag ((ψ + 1)e ) ψ Λ e
s s s s s
Hs := Φ s(n) , X(n) ΦT s(n) , X(n) = (7)
ψs Λs eTs ψs Λ2s
does not change for different n, where ψs is the context sequence length and Λs is the task strength
for the task s, both of which only depend on the task type s. Hs is composed of the feature covari-
ance and target (Eq. (24)). In addition, if we further decompose V as V T = (v1 · · · vNs ) , ∀i ∈
{1, . . . , Ns } : vi ∈ RNs +1 , and recall the decomposition of W Eq. (5), then we can rewrite the
olriginal empirical loss Eq. (6) as (Appendix D.1)
N
1X s
#s T 2
Empirical loss function: L̃ = vs Hs ws − Λs (8)
2 s=1 N
where #s denotes the number of in-context data points for the task type s in the dataset
PN
{Φ(s(n) , X(n) )}Nn=1 , i.e., #s = n=1 δs,s(n) . Eq. (8) indicates that the dynamics of vs and ws
for different s are decoupled: the s-th row of V and s-th column of W are responsible for learning
and predicting the task strength of the task type s, rendering self-attention adapting itself to different
tasks according to the in-context data. With this empirical loss function, we can now use a set of
non-linear ODE systems ∀s ∈ {1, . . . , Ns } :
#s #s
In-context learning dynamics: v̇s = − (fs − Λs ) Hs ws , ẇs = − (fs − Λs ) Hs vs (9)
N N
to describe the in-context learning dynamics by GF where we denote fs = vsT Hs ws for simplicity.
We note that fs is sufficient for us to investigate the dynamical behaviors of the output of self-
attention for task type s and the empirical loss. Thus, by abusing of definition, we refer to the
solution of fs as the solution of the in-context learning dynamics, which can also be applied to give
solutions of vs and ws .
We highlight that the ODE systems above are non-linear for both vs ∈ RNs and ws ∈ RNs , and, ob-
viously, are different from the logistic differential equations obtained from the GF dynamics of deep
linear networks (Saxe et al., 2014; Nam et al., 2024) and different from the Lotka-Volterra predator-
prey model (Volterra, 1928). In this sense, the dynamics of linear self-attention (and transformer)
is different from that of deep linear networks. Meanwhile, we note that non-linear ODE systems,
including Eq. (9) which are non-linear ODE systems for vectors, typically do not admit closed-form
solutions. Therefore we emphasize that solving Eq. (9), which might be of independent interest, to
obtain the explicit dynamical behaviors of linear self-attention is novel as well as intriguing.
Although it is intractable to give the exact closed-form solution to the in-context learning dynam-
ics Eq. (9), in this section, we will provide a solution that can be approximately exact under the
following condition. We defer technical details of this section to Appendix E.
Assumption 3.1. ∀s ∈ {1, . . . , Ns }, the context sequence length ψs ≫ 1.
5
Published as a conference paper at ICLR 2025
Procedure sketch. Before diving into a detailed procedure for deriving the solution, we first
present a rough sketch for it. The first step is to transform ODE systems Eq. (9) to a more sym-
metrical form Eq. (10) by changing of variables. Then we decompose Eq. (10) as two sets of ODE
systems by comparing both sides of Eq. (10) to the zero-th and first orders of ϵs := 1/ψs since Hs
can be decomposed as two parts, Hs0 and ϵs Hs1 with ϵs ≪ 1. We then apply a change of variable
again and derive a new set of ODEs Eq. (11) as Riccati equations Eq. (12) which admit closed-form
solutions by noticing the existence of an important conserved quantity of the dynamics.
We now discuss the procedure in detail. Our first crucial observation is that the ODE of vs Eq. (9)
is non-linear with respect to ws , which makes it hard to solve. Therefore, we first convert Eq. (9)
into a more symmetrical form by changing of variables: let ηs = vs + ws and ρs = vs − ws , then
the dynamics of ηs ∈ RNs +1 and ρs ∈ RNs +1 can be obtained according to Eq. (9)
#s gs − hs #s gs − hs
η̇s = − − Λs Hs ηs , ρ̇s = − Λs Hs ρs , (10)
N 4 N 4
where we define gs = ηsT Hs ηs and hs = ρTs Hs ρs .
In this way, by solving Eq. (10), we can find the solution of the self-attention and the empirical loss
function Eq. (8). However, Eq. (10) is still not directly solvable. Fortunately, recalling the
definition
of Hs in Eq. (7), we can rewrite Hs as a sum of two matrices Hs = ψs Hs0 + ϵs Hs1 where
diag(es ) Λs es diag(es ) 0
Hs0 = , H 1
=
Λs eTs Λ2s s 0 0
and ϵs = 1/ψs ≪ 1 according to Assumption 3.1, which allows us to treat ϵHs1 as an insignificant
perturbation in the dynamics Eq. (10) and solve it using the perturbation analysis.
Specifically, suppose that the solutions of Eq. (10) can be written as ηs = ηs0 + ϵs ηs1 and ρs = ρ0s +
ϵs ρ1s such that ηs1 and ρ1s are treated as perturbations to ηs0 and ρ0s respectively, then gs and hs can
also be written in a perturbed form gs = gs0 + ϵs gs1 and hs = h0s + ϵs h1s accordingly (Appendix E.1).
Now we can obtain ODEs for ηs0 , ηs1 , ρ0s , and ρ1s by comparing terms to the zero-th and first orders
of ϵs in both sides of Eq. (10), respectively (Appendix E.1). This will finally give us ODEs for
gs0 , h0s , gs1 , and h1s , the final ODEs that we aim to solve since the output of self-attention for the task
s can be written as fs := fs0 (t) + ϵs fs1 (t) = [(gs0 − h0s ) + ϵs (gs1 − h1s )]/4.
Our strategy for finding solutions of fs is now composed of two parts using the perturbation analysis:
(i) solve the ODEs for gs0 and h0s exactly and (ii) find ηs0 and ρ0s according to the solved gs0 and h0s ,
then put them into ODEs of gs1 and h1s to find their solutions. As mentioned earlier, fs1 (t) is far less
significant than fs0 (t) to the dynamical behaviors of self-attention given Assumption 3.1, thus we
defer the discussion of fs1 (t) to Appendix J and only focus on fs0 (t).
We now discuss the first step for fs0 (t). Our key observation is that Hs0 is like an idempotent matrix:
(Hs0 )2 = (Λ2s + 1)Hs0 , which gives us the dynamics of gs and hs as a new set of non-linear ODEs:
ġs0 = − gs0 − h0s − 4Λs as gs0 /2, ḣ0s = gs0 − h0s − 4Λs as h0s /2, (11)
where we let as = #s ψs (Λ2s + 1)/N for ease of notation. Though Eq. (11) is still a non-linear
ODE system, it is much more tractable than the original in-context learning dynamics Eq. (9). Our
following key observation can drastically simplify Eq. (11) even further: ∀t ≥ 0 : gs0 h0s = 2Cs
where Cs is a constant determined by the initialization, i.e., gs0 h0s is conserved for the dynamics,
since d(gs0 h0s )/dt = 0. In this way, Eq. (11) becomes the following set of Riccati equations that can
be solved (Appendix E.2) to give our main results:
ġs0 = 2as Λs gs0 − as (gs0 )2 /2 + as Cs , ḣ0s = −2as Λs gs0 − as (h0s )2 /2 + as Cs . (12)
Theorem 3.1 (Solution for in-context learning dynamics of linear self-attention: zero-th order). For
MSFR problem and the in-context learning dynamics by GF of the linear self-attention block Eq. (9),
the solution fs (t) can be approximately written as an expansion fs0 (t) + ϵs fs1 (t) at large ψs with
λs 1 1
fs0 (t) = Λs + − (13)
2 1 + Ps exp(as λs t) 1 + Qs exp(as λs t)
where as = #s ψs (Λ2s + 1)/N , and
p
p 4fs0 (0)Λs + 2Cs + λs 4(fs0 (0))2 + 2Cs 2Λs − λs
2
λs = 4Λs + 2Cs , Ps = , Qs = Ps
2(fs0 (0) − Λs )(λs − 2Λs ) 2Λs + λs
6
Published as a conference paper at ICLR 2025
are determined by the initialization. In addition, when vs0 (0) = ±ws0 (0), the constant Cs = 0 and,
denoting ∆s = (Λs − fs0 (0))/fs0 (0), the solution can be simplified to be a standard logistic function
Λs
fs0 (t) = when vs0 (0) = ±ws0 (0). (14)
1 + ∆s e−2as Λs t
L(t), = 100
and λs jointly control the learning pro- 1.25
L(t), = 5
cess, where Ps , Qs , and λs are determined 0.04 1.00
by the initialization and task strength, e.g., 0.75
when fs0 (0) = Λs both Ps , Qs → ∞ thus 0.03 0.50
fs0 (t) = Λs for t ≥ 0. as is determined 0.25
0.02
by the dataset, e.g., sequence length, and 0.00
task strength. In Fig. 2, we compare the 100 101 102 103
0
loss that is calculated using fs (t) with that t
obtained from direct empirical simulation Figure 2: Loss L(t) for different context sequence lengths
for different context sequence lengths. It ψ(5 and 100) during training. Solid lines are for theoretical
can be seen that when the context sequence predictions while dashed lines are for empirical simulations.
length ψ ≫ 1 (blue lines), the theoretical
prediction of the test loss matches with the empirical results precisely, which validates the accuracy
of fs0 (t). In addition, we emphasize that Theorem 3.1 is applicable for any Ps (Eq. (1)) of the task
type, F(s) of the sequence length, and G(s) of the task strength for the MSFR problem in Section 2,
i.e., the derivation of fs0 (t) does not require assumptions on forms of Ps , F(s), and G(s).
fs0 (t) can explicitly characterize the dynamical behaviors of the self-attention block including the
influence of various parameters, suggesting that it could contribute to the understanding of self-
attention in a variety of aspects. In this paper, our focus will be neural scaling laws. For another ex-
ample, fs0 (t) shows that self-attention learns different tasks in different rates that depend on various
parameters such as the sequence length, task strength, number of data points, and the initialization.
The difference of learning speeds for different tasks might lead to the grokking phenomenon (Power
et al., 2022), since the model can quickly fit a fraction of tasks while learns the rest extremely slowly.
Concretely, we consider neural scaling laws of linear self-attention with respect to each one of size
of the model D, training time t, and the number of training data N when the other two factors are
not the bottleneck of training. Finally, we consider scaling laws for the optimal compute C. We list
the discussion of these factors as follows and defer details to Appendix F.
• Size of the model D. To quantify the model size D, inspired by Michaud et al. (2023);
Bordelon & Pehlevan (2022); Nam et al. (2024) which considered different models empir-
ically or theoretically, we assume that there is a cutoff D such that the model cannot learn
any task strength Λs with s ≥ D. Specifically, we let WKQ ∈ RD×D , V ∈ R(D−1)×D
and apply a new feature extractor such that ϕ′ (s, x) ∈ RD−1 only extracts the first D − 1
7
Published as a conference paper at ICLR 2025
elements of the original ϕ(s, x). When D is the bottleneck of training, we let t, N → ∞
(they are sufficient for the training) to derive scaling laws with respect to D.
• Optimal compute C. This is the case when the number of data points is sufficient for the
training, while training time t or the size of model D is the bottleneck given the compute
budget C = Dt such that either t or D scales differently with C. Specifically, if L(t, D) =
at t−αt +aD D−αD , then we can derive the optimal test loss as L ∝ C −αt αD /(αt +αD ) given
C = tD (Appendix F.1).
To inspect the effects of the context sequence length ψs ∈ R and task strength Λs ∈ R on neural
scaling laws, we let ψs = F(s) ∝ s−β , which is inspired by the underlying power-law correlations
in language sequence (Ebeling & Pöschel, 1994), and Λs = G(s) ∝ s−γ with β, γ > 0. We will
investigate two different cases. In the first case (Section 4.1), we assume that γ = β = 0 such
that the context sequence length ψs and task strength Λs do not depend on the task type s. In this
setting, we can compare neural scaling laws of linear self-attention with other architectures. In the
second case (Section 4.2), we study positive β and γ, which are unique to self-attention. Details of
Section 4.1 and 4.2 will be deferred to F.2 and F.3, respectively.
= 1.8 = 1.8
100 D + 1, = 1.8 t ( 1)/ , = 1.8
= 2.1 Theory, = 1.8
D + 1, = 2.1 = 2.1
t ( 1)/ , = 2.1
Theory, = 2.1
L(D)
L(t)
10 1 10 1
10 2
10 1
L(N)
L( )
D=5
D = 10
D = 20
10 1 D = 25
D = 50
D = 100
( 1)/( + 1)
Figure 3: Neural scaling laws for linear self-attention with different values of α = 1.8, 2.1. In each figure,
we use solid lines to represent empirical simulation results and dashed lines for power law curves. We also plot
theoretical predictions of test loss in (b) with dotted lines as a comparison. In (d), we set α = 1.8 and use
different levels of transparency to reflect different model sizes D within the range [Ns /100, Ns /5].
For simplicity we assume that the model is initialized as Cs = C and fs0 (0) = f0 , which implies
that λs = λ, Ps = P , and Qs = Q for s ∈ {1, . . . , Ns } do not depend on the task type s. When
the context sequence length ψs and task strength Λs are the same for different task types s, i.e.,
8
Published as a conference paper at ICLR 2025
Architecture does not matter for scaling laws when context sequence length is fixed. The results
summarized in Table 1 reveal that, when data admits a similar power-law structure, linear self-
attention shares the same neural scaling laws with ReLU MLPs (Michaud et al., 2023) and diagonal
linear networks (Nam et al., 2024) with respect to t, N and D. Linear self-attention also exhibits
a similar time scaling law as the linear models considered in Bordelon et al. (2024) and Hutter
(2021). These similarities indicate that the architecture of model does not affect exponents of neural
scaling laws significantly, which well aligns with the empirical conclusion reached by Kaplan et al.
(2020) where they showed that transformers share similar exponents of neural scaling laws with
other models when the power-law structures hold.
To further capture how the context sequence length ψs and the task strength Λs affect neural scaling
laws for the in-context learning of self-attention, we let ψs = F(s) ∝ s−β , Λs = G(s) ∝ s−γ as in
Section 2.2. We can write the test loss as
Ns 2
ZX ∆ exp(−2as Λs t)
L(t) ≈ s−α−2γ . (17)
2 s=1 1 + ∆ exp(−2as Λs t)
Following a similar procedure as in Section 4.1, we derive neural scaling laws of linear self-attention
for MSFR problem in Table 2, which gives us the following insights.
Table 2: Neural scaling laws for linear self-attention when both ψs and Λs depend on s
Scaling Law Condition
Varied context sequence length affects the scaling law of time. Table 2 reveals that a varied con-
text sequence length makes the learning process slower (Fig. 4a). According to Table 2, a nonzero
positive β leads to a larger exponent of time law, thus the test loss will decrease slower than the case
when β = 0. This suggests that it is better to balance the context sequence length for different tasks
to obtain a satisfied test loss given a limitation of optimization steps. This conclusion is special to
self-attention compared to other architectures considered in previous theoretical works since they
lack the place for the context sequence length. On the other hand, we also find that β does not
appear in scaling laws for the size of model and number of data points, indicating that self-attention
can still admit similar scaling laws for D and N as other architectures when γ = 0.
9
Published as a conference paper at ICLR 2025
Simulation
Theory 4 × 10 1
t ( 1)/( + )
100 t ( 1)/ , = 0 3 × 10 1
D = 10
L( )
L(t) 2 × 10 1
D = 15
D = 20
D = 25
D = 40
D = 50
( 1)/( + + 1)
10 1 10 1 ( 1)/( + 1), = 0
L(D)
L(t)
10 2
10 3
Simulation
10 1 D 2 +1
10 4 D + 1, = = 0
101 102 103 101 102
t D
(c) Time Law, β = 1, γ = 1.5 (d) Model Size Law, β = 1, γ = 1.5
6 × 10 1
101
100 4 × 10 1
D = 50
D = 75
3 × 10 1 D = 100
L(N)
L( )
10 1 D = 125
D = 150
D = 175
2 × 10 1
D = 200
10 2 ( +2 1)/( + 3 + + 1)
Simulation
N ( + 2 1)/ ( 1)/( + + 1), =0
N ( 1)/ , = = 0 ( 1)/( + 1), =0
10 3
101 102 104 105
N
(e) Data Size Law, β = 1, γ = 1.5 (f) Optimal Compute Law, β = 1, γ = 1.5
Figure 4: Neural scaling laws for self-attention with varied sequence length ψs ∝ s−β and strength Λs ∝ s−γ .
In each figure, we set α = 1.8 and use solid lines to represent empirical simulation results while use dashed
lines for power law curves. In addition, we plot the power law curves (black dashed lines) with γ, β = 0 from
Table 1 as a comparison. In (a) and (b), we set γ = 0 to examine effects of ψs on time and optimal compute
laws only. In (b) and (f), we use different levels of transparency to reflect varying model sizes D.
Varied task strength affects all scaling laws. Table 2 reveals that a varied task strength reduces
the requirements of the size of model (Fig. 4d) or the number of data points (Fig. 4e) in our MSFR
problem. Specifically, due to the existence of a positive γ, exponents of scaling laws for both size of
model and number of data points become smaller, thus the learning requires fewer number of data
points or smaller size of model to achieve a similar test loss when they are the bottleneck.
5 C ONCLUSION
In this paper, we target on understanding learning dynamics of self-attention, which stands at the
core of the transformer architectures, from a theoretical perspective. For this purpose, we first design
a multitask sparse feature regression problem for the self-attention to learn in an in-context manner,
whose learning dynamics is then modelled as non-linear ODE systems. We then give a tractable
solution to the ODE systems, which should be of broad interest since non-linear ODE systems
typically do not admit closed-form solutions. We also highlight that this solution can be employed
as an interesting proxy for studying a variety of properties of self-attention and transformers. Finally,
we use the proposed solution to investigate neural scaling laws of self-attention with respect to each
one of training time, number of data points, and size of the model when the other two are not the
bottleneck of the learning process, which in turn allows us to establish the neural scaling law with
respect to the optimal compute budget.
10
Published as a conference paper at ICLR 2025
R EFERENCES
Ben Adlam and Jeffrey Pennington. The neural tangent kernel in high dimensions: Triple descent
and a multi-scale theory of generalization, 2020. URL https://arxiv.org/abs/2008.
06786.
Subutai Ahmad and Gerald Tesauro. Scaling and generalization in neural networks: A case study.
In D. Touretzky (ed.), Advances in Neural Information Processing Systems, volume 1. Morgan-
Kaufmann, 1988. URL https://proceedings.neurips.cc/paper_files/paper/
1988/file/d1f491a404d6854880943e5c3cd9ca25-Paper.pdf.
Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, and Suvrit Sra. Transformers learn to implement
preconditioned gradient descent for in-context learning, 2023a. URL https://arxiv.org/
abs/2306.00297.
Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, and Suvrit Sra. Transformers learn to implement
preconditioned gradient descent for in-context learning, 2023b. URL https://arxiv.org/
abs/2306.00297.
Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning
algorithm is in-context learning? investigations with linear models, 2023. URL https://
arxiv.org/abs/2211.15661.
Alexander Atanasov, Blake Bordelon, and Cengiz Pehlevan. Neural networks as kernel learners:
The silent alignment effect, 2021. URL https://arxiv.org/abs/2111.00034.
Alexander Atanasov, Blake Bordelon, Sabarish Sainathan, and Cengiz Pehlevan. The onset of
variance-limited behavior for networks in the lazy and rich regimes, 2022. URL https:
//arxiv.org/abs/2212.12147.
Alexander Atanasov, Jacob A. Zavatone-Veth, and Cengiz Pehlevan. Scaling and renormalization in
high-dimensional regression, 2024. URL https://arxiv.org/abs/2405.00592.
Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma. Explaining neural
scaling laws. Proceedings of the National Academy of Sciences, 121(27), June 2024. ISSN
1091-6490. doi: 10.1073/pnas.2311878121. URL http://dx.doi.org/10.1073/pnas.
2311878121.
Boaz Barak, Benjamin L. Edelman, Surbhi Goel, Sham M. Kakade, eran malach, and Cyril Zhang.
Hidden progress in deep learning: SGD learns parities near the computational limit. In Al-
ice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neu-
ral Information Processing Systems, 2022. URL https://openreview.net/forum?id=
8XWP2ewX-im.
Carl M. Bender and Steven A. Orszag. Advanced Mathematical Methods for Scientists and Engi-
neers. McGraw Hill, 1978.
Blake Bordelon and Cengiz Pehlevan. Learning curves for SGD on structured features. In Interna-
tional Conference on Learning Representations, 2022. URL https://openreview.net/
forum?id=WPI2vbkAl3Q.
Blake Bordelon, Abdulkadir Canatar, and Cengiz Pehlevan. Spectrum dependent learning curves in
kernel regression and wide neural networks, 2021. URL https://arxiv.org/abs/2002.
02561.
Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan. A dynamical model of neural scaling
laws. In Forty-first International Conference on Machine Learning, 2024. URL https://
openreview.net/forum?id=nbOY1OmtRc.
Lukas Braun, Clémentine Dominé, James Fitzgerald, and Andrew Saxe. Exact learn-
ing dynamics of deep linear networks with prior knowledge. In S. Koyejo, S. Mo-
hamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural
Information Processing Systems, volume 35, pp. 6615–6629. Curran Associates, Inc.,
2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/
file/2b3bb2c95195130977a51b3bb251c40a-Paper-Conference.pdf.
11
Published as a conference paper at ICLR 2025
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari-
wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar-
wal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh,
Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz
Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec
Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In
H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neu-
ral Information Processing Systems, volume 33, pp. 1877–1901. Curran Associates, Inc.,
2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/
file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares algorithm. Foun-
dations of Computational Mathematics, 7(3):331–368, 2007. doi: 10.1007/s10208-006-0196-8.
URL https://doi.org/10.1007/s10208-006-0196-8.
Hugo Cui, Bruno Loureiro, Florent Krzakala, and Lenka Zdeborová. Generalization error rates in
kernel regression: the crossover from the noiseless to noisy regime*. Journal of Statistical Me-
chanics: Theory and Experiment, 2022(11):114004, November 2022. ISSN 1742-5468. doi: 10.
1088/1742-5468/ac9829. URL http://dx.doi.org/10.1088/1742-5468/ac9829.
Stéphane d’Ascoli, Maria Refinetti, Giulio Biroli, and Florent Krzakala. Double trouble in double
descent : Bias and variance(s) in the lazy regime, 2020. URL https://arxiv.org/abs/
2003.01054.
Karthik Duraisamy. Finite sample analysis and bounds of generalization error of gradient descent in
in-context linear regression, 2024. URL https://arxiv.org/abs/2405.02462.
W Ebeling and T Pöschel. Entropy and long-range correlations in literary english. Europhysics
Letters (EPL), 26(4):241–246, May 1994. ISSN 1286-4854. doi: 10.1209/0295-5075/26/4/001.
URL http://dx.doi.org/10.1209/0295-5075/26/4/001.
Benjamin L. Edelman, Surbhi Goel, Sham Kakade, and Cyril Zhang. Inductive biases and vari-
able creation in self-attention mechanisms, 2022. URL https://arxiv.org/abs/2110.
10090.
Shivam Garg, Dimitris Tsipras, Percy Liang, and Gregory Valiant. What can transformers learn
in-context? a case study of simple function classes, 2023. URL https://arxiv.org/abs/
2208.01066.
Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d’Ascoli,
Giulio Biroli, Clément Hongler, and Matthieu Wyart. Scaling description of generalization with
number of parameters in deep learning. Journal of Statistical Mechanics: Theory and Experi-
ment, 2020(2):023401, February 2020. ISSN 1742-5468. doi: 10.1088/1742-5468/ab633c. URL
http://dx.doi.org/10.1088/1742-5468/ab633c.
Gauthier Gidel, Francis Bach, and Simon Lacoste-Julien. Implicit regularization of dis-
crete gradient dynamics in linear neural networks. In H. Wallach, H. Larochelle,
A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neu-
ral Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL
https://proceedings.neurips.cc/paper_files/paper/2019/file/
f39ae9ff3a81f499230c4126e01f421b-Paper.pdf.
Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad,
Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable,
empirically, 2017. URL https://arxiv.org/abs/1712.00409.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza
Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hen-
nigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy,
Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre.
Training compute-optimal large language models, 2022. URL https://arxiv.org/abs/
2203.15556.
12
Published as a conference paper at ICLR 2025
Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and gen-
eralization in neural networks, 2020. URL https://arxiv.org/abs/1806.07572.
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child,
Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language
models, 2020. URL https://arxiv.org/abs/2001.08361.
Shuai Li, Zhao Song, Yu Xia, Tong Yu, and Tianyi Zhou. The closeness of in-context learning
and weight shifting for softmax regression, 2023a. URL https://arxiv.org/abs/2304.
13276.
Yingcong Li, M. Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak. Transformers as
algorithms: Generalization and stability in in-context learning, 2023b. URL https://arxiv.
org/abs/2301.07067.
Yue M. Lu, Mary I. Letey, Jacob A. Zavatone-Veth, Anindita Maiti, and Cengiz Pehlevan. Asymp-
totic theory of in-context learning by linear attention, 2024. URL https://arxiv.org/
abs/2405.11751.
Alexander Maloney, Daniel A. Roberts, and James Sully. A solvable model of neural scaling laws,
2022. URL https://arxiv.org/abs/2210.16859.
Eric J Michaud, Ziming Liu, Uzay Girit, and Max Tegmark. The quantization model of neural
scaling. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL
https://openreview.net/forum?id=3tbTw2ga8K.
Yoonsoo Nam, Nayara Fonseca, Seok Hyeong Lee, Chris Mingard, and Ard A. Louis. An exactly
solvable model for emergence and scaling laws, 2024. URL https://arxiv.org/abs/
2404.17563.
Elliot Paquette, Courtney Paquette, Lechao Xiao, and Jeffrey Pennington. 4+3 phases of compute-
optimal neural scaling laws, 2024. URL https://arxiv.org/abs/2405.15074.
Hannah Pinson, Joeri Lenaerts, and Vincent Ginis. Linear cnns discover the statistical structure of
the dataset using only the most dominant frequencies, 2023. URL https://arxiv.org/
abs/2303.02034.
Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Gener-
alization beyond overfitting on small algorithmic datasets, 2022. URL https://arxiv.org/
abs/2201.02177.
Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction
of the generalization error across scales, 2019. URL https://arxiv.org/abs/1909.
12673.
Jonathan S. Rosenfeld, Jonathan Frankle, Michael Carbin, and Nir Shavit. On the predictability of
pruning across scales, 2021. URL https://arxiv.org/abs/2006.10621.
Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dy-
namics of learning in deep linear neural networks, 2014. URL https://arxiv.org/abs/
1312.6120.
Utkarsh Sharma and Jared Kaplan. Scaling laws from the data manifold dimension. J. Mach. Learn.
Res., 23(1), January 2022. ISSN 1532-4435.
James B. Simon, Madeline Dickens, Dhruva Karkada, and Michael R. DeWeese. The eigenlearning
framework: A conservation law perspective on kernel regression and wide neural networks, 2023.
URL https://arxiv.org/abs/2110.03922.
13
Published as a conference paper at ICLR 2025
Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Morcos. Be-
yond neural scaling laws: beating power law scaling via data pruning. In S. Koyejo,
S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neu-
ral Information Processing Systems, volume 35, pp. 19523–19536. Curran Associates, Inc.,
2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/
file/7b75da9b61eda40fa35453ee5d077df6-Paper-Conference.pdf.
Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The im-
plicit bias of gradient descent on separable data, 2024. URL https://arxiv.org/abs/
1710.10345.
Stefano Spigler, Mario Geiger, and Matthieu Wyart. Asymptotic learning curves of kernel methods:
empirical data versus teacher–student paradigm. Journal of Statistical Mechanics: Theory and
Experiment, 2020(12):124001, December 2020. ISSN 1742-5468. doi: 10.1088/1742-5468/
abc61d. URL http://dx.doi.org/10.1088/1742-5468/abc61d.
Ingo Steinwart, Don Hush, and Clint Scovel. Optimal rates for regularized least squares regression.
In Proceedings of the 22nd Annual Conference on Learning Theory, 01 2009.
Davoud Ataee Tarzanagh, Yingcong Li, Christos Thrampoulidis, and Samet Oymak. Transformers
as support vector machines, 2024. URL https://arxiv.org/abs/2308.16898.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von
Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Ad-
vances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.,
2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/
file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
Vito Volterra. Variations and Fluctuations of the Number of Individuals in Animal Species living
together. ICES Journal of Marine Science, 3(1):3–51, 04 1928. ISSN 1054-3139. doi: 10.1093/
icesjms/3.1.3. URL https://doi.org/10.1093/icesjms/3.1.3.
Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordv-
intsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient
descent, 2023. URL https://arxiv.org/abs/2212.07677.
Alexander Wei, Wei Hu, and Jacob Steinhardt. More than a toy: Random matrix models predict
how real-world neural representations generalize, 2022. URL https://arxiv.org/abs/
2203.06176.
Jingfeng Wu, Difan Zou, Zixiang Chen, Vladimir Braverman, Quanquan Gu, and Peter L. Bartlett.
How many pretraining tasks are needed for in-context learning of linear regression?, 2024. URL
https://arxiv.org/abs/2310.08391.
Ruiqi Zhang, Spencer Frei, and Peter L. Bartlett. Trained transformers learn linear models in-
context, 2023. URL https://arxiv.org/abs/2306.09927.
14
Published as a conference paper at ICLR 2025
A PPENDIX
The structure of the Appendix is as follows.
What will the prediction of the self-attention for the query token ϕ̂(s, x̂) be given a different input
sequence Φ̃(s, X)? Below we use our example to illustrate that the perfectly trained linear self-
attention will adapt itself for Φ̃(s, X). In particular, let the task strength for the input tokens in
Φ̃(s, X) prior to the query token be Λ̃s ̸= Λs (the task strength is unseen in the training data). Then
the prediction of the self-attention for this input sequence Φ̃(s, X) becomes
fs = vsT (∞)Φ̃Φ̃T ws (∞)
(18)
= vs;Ns +1 (∞)ws;s (∞)ψs Λ̃s = Λ̃s ,
which indicates that linear self-attention successfully predicts the unseen task strength Λ̃s by per-
forming in-context learning to explore in-context data to adapt itself to a new task.
In addition, we can show that the above newly imposed constraint does not affect the learning dy-
namics, hence our method for solving the dynamics in Section 3.2 can still be applied here even
without Assumption 3.1, i.e., the solution obtained here will be an exact one. To see this point more
clearly, we rewrite the output of the model with the constraints as
fs := vs Hs ws = vs;Ns +1 ψs Λs ws;s ,
where vs = vs;Ns +1 and ws = ws;s are 1-dimensional vectors while Hs = ψs Λs is a 1 × 1 matrix
that satisfies the idempotent-like condition. Thus the dynamics will be exactly the same as Eq. (9),
and Theorem 3.1 can still be applied.
15
Published as a conference paper at ICLR 2025
Finally, we briefly explain the reason why we impose these constraints. For the task s and the input
sequence Φ(s, X) with task strength Λs , the output of the model is
vsT ΦΦT ws = ψs (vs;s ws;s + Λs (vs;Ns +1 ws;s + ws;Ns +1 vs;s ) + Λ2s vs;Ns +1 ws;Ns +1 ).
We note that the first term does not depend on Λs explicitly, meaning that it does not explore the
relationship between ϕ(s, x) and y, thus it does not contribute to the in-context learning task. In ad-
dition, the last term depends on Λs in a non-linear way, leading to an inappropriate dependence on
the relationship between x and y. Due to such terms, without any conditions, the linear self-attention
is hard to perfectly adapt itself for new input sequences Φ̃(s, X) with different task strengths. This
problem is caused by its structure. A possible way to achieve a perfect in-context adaption is impos-
ing the constraints vs;s = ws;Ns +1 = 0 (the constraints used in our example and Zhang et al. (2023);
Wu et al. (2024); Lu et al. (2024)) to eliminate the aforementioned terms. Therefore, we expect that
an optimal in-context learner should set the first Ns components as 0, since they contribute to many
terms that are not necessary for the model to achieve perfect in-context adaption.
Besides the aforementioned works on learning dynamics of neural networks or random feature mod-
els, Pinson et al. (2023) studied the learning dynamics of gradient descent for linear convolution
neural networks. Particularly, they discovered an interesting interplay between the data structure
and network structure that determines the phases of the network along the training trajectory. The
learning dynamics is also analyzed by Gidel et al. (2019) while Braun et al. (2022); Atanasov et al.
(2021) focused on different initialization regimes. Our focus in this paper is the learning dynamics
of linear self-attention with the in-context learning.
Besides Kaplan et al. (2020); Hoffmann et al. (2022), there are a number of recent works that ex-
plored scaling laws in deep neural networks empirically (Rosenfeld et al., 2021; Hestness et al.,
2017; Rosenfeld et al., 2019). The study of neural scaling laws can be found in some earlier
works (Caponnetto & De Vito, 2007; Steinwart et al., 2009; Ahmad & Tesauro, 1988). From the
theoretical perspective, various works developed solvable models in the context of random feature
models (Bahri et al., 2024; Atanasov et al., 2022; 2024; Bordelon et al., 2024; Paquette et al., 2024)
to study the neural scaling laws in certain limits. In addition, Wei et al. (2022); Bordelon et al.
(2021); Sharma & Kaplan (2022); Bordelon & Pehlevan (2022) also conducted theoretical analysis
16
Published as a conference paper at ICLR 2025
on linear models to investigate the neural scaling laws. These works improve the theoretical under-
standing for the neural scaling law to a large extent. As a comparison, our focus in this paper is
particularly on the linear self-attention with the in-context learning, which is not widely discussed
in previous works.
The success of the transformer architecture (Vaswani et al., 2017) has encouraged a majority body
of works to investigate its theoretical understanding, especially the intriguing in-context learning
mechanism. A common setup along this direction is the study of linear regression using the linear
self-attention (Duraisamy, 2024; Ahn et al., 2023b; Wu et al., 2024; Lu et al., 2024; Zhang et al.,
2023). Our work also falls into this category as mentioned in Section 2.3. We present a more
extensive comparison and connection to Zhang et al. (2023) and Lu et al. (2024) in the following.
Comparison and connection to Zhang et al. (2023). Zhang et al. (2023) considered linear regres-
sion using linear self-attention in the in-context learning manner. By assuming the infinite training
dataset size limit, Zhang et al. (2023) revealed that the converged linear self-attention can achieve
a competitive performance compared to the best linear predictor over the test data distribution. As
a comparison, our setting is also a regression for the linear self-attention to learn in an in-context
manner, while our problem is a multitask version and the distribution of the task type obeys a power
law. In addition, besides the converged solution of the dynamics, we derive its (approximate) form
along the whole training trajectory, which in turn makes it possible to characterize the neural scaling
laws with respect to time, data size, model size, and the optimal compute. And we note that the
characterization for the time scaling law (and the optimal compute law) cannot be derived solely by
the converged solution.
Comparison and connection to Lu et al. (2024). Lu et al. (2024) proposed a solvable model
of in-context learning for a linear regression task by linear self-attention. Specifically, assuming a
limit where the input dimension, the context sequence length, the training task diversity, and the
data size are all taken to infinity following certain ratios, they revealed a double-descent learning
curve with respect to the number of examples. As a comparison, our problem is also a multitask
regression for the linear self-attention to learn in an in-context way. In addition, as we will explain
in Appendix H, our problem can be seen as a limiting case of the multitask in-context regression
under the source-capacity condition, which is also a generalized version of the setup considered in
Lu et al. (2024). Furthermore, our solution (to the first order of ϵs when the context length is large)
captures the whole training trajectory and we study the neural scaling laws with respect to various
parameters so that we do not assume that all parameters are taken to infinity together.
17
Published as a conference paper at ICLR 2025
• gs and hs
gs = ηs · Hs ηs , hs = ρs · Hs ρs .
• In-context learning dynamics Eq. (9)
#s #s
v̇s = − (fs − Λs ) Hs ws , ẇs = − (fs − Λs ) Hs vs (19)
N N
• Test loss Eq. (15)
N
1X s
2
L(t) = Ex∼PX ,s∼Pα [ℓ (f (Φ(s, X); θ), y)] ≈ Pα (S = s) fs0 (t) − Λs . (20)
2 s=1
To navigate the paper, we also present a table for notation and definitions with the corresponding
index in Table 3.
18
Published as a conference paper at ICLR 2025
Deriving Hs . To derive Hs (Eq. (7)), we decompose the in-context data point Φ(s, X) (Eq. (4))
for the task s as
P ϕ̂s
Φ(s, X) = Ts (21)
ys 0
where
(1)
h i ys
Ns ×ψs ..
Ps = ϕ(1)
s · · · ϕs
(ψs )
∈ R , y s = ψ
. ∈ R s, (22)
(ψs )
ys
and ϕ̂s = ϕ(s, x̂). Then
Ps ϕ̂s PsT ys
Φ(s, X)(Φ(s, X))T =
ysT 0 ϕ̂Ts 0
T T
Ps Ps + ϕ̂s ϕ̂s Ps ys
= . (23)
ysT PsT ysT ys
There are four terms for us to compute to get the form of Hs in Eq. (23): (i) the first one is
(1) T
h i (ϕs ) ψs
(ψs ) .. X (j) (j) T
Ps PsT = ϕ(1)s · · · ϕs . = ϕs (ϕs ) = diag(es )ψs (24)
(ψ ) j=1
(ϕs s )T
T
where the standard basis vector in RNs is es = (0 · · · 0 1 0 · · ·) ∈ RNs for s ∈
{1, . . . , Ns } such that the only nonzero component of es is its s-th component; (ii) the second
one is ϕ̂s ϕ̂Ts = diag(es ), which can be easily verified; (iii) the third one is
(1)
h i ys Xψs
(ψs ) ..
Ps ys = ϕ(1)
s · · · ϕ s . = ϕ(j) (j)
s ys = ψs Λs es ;
(ψ ) j=1
ys s
Pψ (j)
(iv) the final one is ysT ys = j s (ys )2 = ψs Λ2s . Combining these terms gives us the form of Hs :
diag ((ψs + 1)es ) ψs Λs es
Hs = .
ψs Λs eTs ψs Λ2s
As a result, the output of self-attention for the in-context data point Φ(s(n) , X(n) ) will be
f Φ(s(n) , X(n) ) = vsT Hs(n) W ϕ̂s(n) ,
which gives us the empirical loss as
XN h i2
a 1
L̃ = f Φ(s(n) , X(n) ); θ − ŷ (n)
2N n=1
N
" Ns
#2
b 1
X X
T (n) (n)
= v (n) Hs(n) W ϕ̂s(n) − Λs(n) ϕk (s , x̂ )
2N n=1 s
k=1
XN XNs
c 1 T 2
= vs(n) Hs(n) ws(n) − Λs(n) (ϕk (s(n) , x̂(n) ))2
2N n=1
k=1
XN
d 1 T 2
= vs(n) Hs(n) ws(n) − Λs(n)
2N n=1
N
e 1X s
#s T 2
= vs Hs ws − Λs , (25)
2 s=1 N
where we use the definition of empirical loss in a, the definition of target y in b, the decomposition
of W Eq. (5) in c, ϕk (s, x̂) = ±δs,k according to definition of ϕ Eq. (2) in d, and recall that #s
denotes the number of in-context data points with s(n) = s for n ∈ {1, . . . , N } in e.
19
Published as a conference paper at ICLR 2025
We adopt the continuous time limit of gradient descent, i.e., gradient flow, to perform the empirical
loss minimization
V̇ = −∇V L̃(V , W ), Ẇ = −∇W L̃(V , W ). (26)
Specifically, for empirical loss function Eq. (8), we can directly obtain the learning dynamics as
non-linear ODE systems: ∀s ∈ {1, . . . , Ns }
#s
v̇s = − (fs − Λs ) Hs ws ,
N
#s
ẇs = − (fs − Λs ) Hs vs
N
where recall that we denote fs = vsT Hs ws .
In E.1, we derive the ODEs of gs and hs , while we solve the ODEs to the zero-th and first order of
ϵs under Assumption 3.1 in E.2 and J, respectively.
There are four steps to derive ODEs for gs and hs , which will be discussed as follows.
Step I: change of variable. As discussed earlier, the ODE of vs Eq. (9) is non-linear with respect
to ws , thus we transform it to a more symmetrical form by the change of variable: let
ηs = vs + ws , ρs = vs − ws ,
then the dynamics of ηs ∈ RNs +1 and ρs ∈ RNs +1 can be obtained according to Eq. (9)
#s gs − hs #s gs − hs
η̇s = − − Λs Hs ηs , ρ̇s = − Λs Hs ρs , (27)
N 4 N 4
as a result, the ODE of ηs is non-linear with respect to ηs while that of ρs is non-linear to ρs .
Step II: deriving ODEs to different orders of ϵs . According to the definition of Hs in Eq. (7),
we can rewrite Hs as a sum of two matrices
Hs = ψs Hs0 + ϵs Hs1
where
diag(es ) Λs es diag(es ) 0
Hs0 = 1
, Hs =
Λs eTs Λ2s 0 0
and ϵs = 1/ψs ≪ 1 given Assumption 3.1. Therefore we can treat ϵHs1 as an insignificant pertur-
bation in the dynamics Eq. (27). Let the solutions of Eq. (27) be
ηs = ηs0 + ϵs ηs1 , ρs = ρ0s + ϵs ρ1s
such that ηs1 and ρ1s are treated as perturbations to ηs0 and ρ0s , respectively. Then, according to the
definitions of gs and hs , we can also write gs and hs in the perturbed form
gs := gs0 + ϵs gs1 = ηs · Hs ηs
= ψs ηs0 + ϵs ηs1 · Hs0 + ϵs Hs1 ηs0 + ϵs ηs1
= ψs ηs0 · Hs0 ηs0 + ψs ϵs ηs0 · Hs1 ηs0 + 2ηs1 · Hs0 ηs0 + O(ϵ2s ) (28)
hs := h0s + ϵs h1s = ρs · Hs ρs
= ψs ρ0s · Hs0 ρ0s + ψs ϵs ρ0s · Hs1 ρ0s + 2ρ1s · Hs0 ρ0s + O(ϵ2s ). (29)
20
Published as a conference paper at ICLR 2025
Putting the above perturbation forms back to Eq. (27) to the first order of ϵs , we have
ψs #s 0
η̇s0 + ϵs η̇s1 = − gs − h0s + ϵs gs1 − h1s − 4Λs Hs0 ηs0 + ϵs Hs1 ηs0 + Hs0 ηs1 + O(ϵ2s ),
4N
0 1 ψ s #s 0
ρ̇s + ϵs ρ̇s = gs − h0s + ϵs gs1 − h1s − 4Λs Hs0 ρ0s + ϵs Hs1 ρ0s + Hs0 ρ1s + O(ϵ2s ).
4N
(30)
Matching both sides of Eq. (30) to the zero-th and first order of ϵs respectively gives us
ψs #s 0 ψs # s 0
η̇s0 = − gs − h0s − 4Λs Hs0 ηs0 , ρ̇0s = gs − h0s − 4Λs Hs0 ρ0s (31)
4N 4N
and
ψ s #s 0
η̇s1 = − gs − h0s − 4Λs Hs1 ηs0 + Hs0 ηs1 + (gs1 − h1s )Hs0 ηs0 ,
4N (32)
1 ψs # s 0
ρ̇s = gs − h0s − 4Λs Hs1 ρ0s + Hs0 ρ1s + (gs1 − h1s )Hs0 ρ0s .
4N
Step III: deriving ODEs for gs and hs to zero-th and first orders of ϵs . We can now obtain the
ODEs for gs and hs to the zero-th order of ϵs by directly applying the definitions and Eq. (31):
d ψs #s ηs0 · Hs0 Hs0 ηs0
ġs0 = ψs ηs0 · Hs0 ηs0 = −ψs gs0 − h0s − 4Λs
dt 2N
a 0 0
as ψs ηs0 · Hs0 ηs0
= − gs − hs − 4Λs
Zero-th Order: 2 (33)
0 0
a s gs
0
= − gs − hs − 4Λs ,
2
as h0s
ḣ0s = gs0 − h0s − 4Λs ,
2
where we use (Hs0 )2 = (Λ2s + 1)Hs0 in a and derive the equation for h0s in a similar way. Suppose
that we obtain the solution to the zero-th order by solving the above ODEs, then, according to the
definition of gs1 in Eq. (28) and the definition of h1s in Eq. (29), we only need to find the solutions of
ms = ψs ηs1 · Hs0 ηs0 and ns = ψs ρ1s · Hs0 ρ0s (34)
to derive the solution to the first order of ϵs , since we can obtain ηs0 · Hs1 ηs0 and ρ0s · Hs1 ρ0s using
the solutions of Eq, (33)(see Appendix J.1). This means that based on Eq. (32), we need to solve the
following ODEs:
ṁs = ψs ηs1 · Hs0 η̇s0 + ψs η̇s1 · Hs0 ηs0
ψ 2 #s 0
=− s gs − h0s − 4Λs ηs1 · Hs0 Hs0 ηs0 + ηs0 · Hs0 Hs1 ηs0 + Hs0 Hs0 ηs1
4N
First Order: ψ 2 #s
− s (gs1 − h1s )ηs0 Hs0 Hs0 ηs0
4N
as ms ψ 2 #s as gs0
= − gs − h0s − 4Λs
0
+ s ηs0 · Hs0 Hs1 ηs0 − (gs1 − h1s )
2 4N 4
(35)
and, similarly,
ṅs = ρ1s · Hs0 ρ̇0s + ρ̇1s · Hs0 ρ0s
ψ 2 #s 0
= s gs − h0s − 4Λs ρ1s · Hs0 Hs0 ρ0s + ρ0s · Hs0 Hs1 ρ0s + Hs0 Hs0 ρ1s
4N
First Order: ψ 2 #s
+ s (gs1 − h1s )ρ0s Hs0 Hs0 ρ0s
4N
as ns ψ 2 #s as h0s
= gs − h0s − 4Λs
0
+ s ρ0s · Hs0 Hs1 ρ0s + (gs1 − h1s ) .
2 4N 4
(36)
Then we can write the solution of gs1 and h1s as
gs1 = ψs ηs0 Hs1 ηs0 + 2ms , h1s = ψs ρ0s Hs1 ρ0s + 2ns .
We will solve the zero-th order ODEs in the next section and discuss the solution of the first order
ODEs in J.
21
Published as a conference paper at ICLR 2025
where we use ∥a∥2H = aT Ha for a positive definite matrix H and vector a and we note that Cs = 0
if vs0 (0) = ±ws0 (0). In the following, we adopt a series of change of variable to solve Eq. (39). For
simplicity, we omit the s subscript and recover it in the final solution.
Step I. Let
ag ah
p=− , q=− , (41)
2 2
then we can transform Eq. (39) to
aġ a2 C
ṗ = − = 2aΛp + p2 −
2 2
(42)
aḣ a2 C
q̇ = − = −2aΛq + q 2 − .
2 2
Step III. Eq. (44) are just the second-order linear ODEs, which can be solved following a standard
approach. Specifically, let
γ = ect , θ = ebt ,
then putting them back into Eq. (44) gives us
√
2 a2 C 4Λ2 a2 + 2a2 C
c − 2Λac − = 0 =⇒ c = Λa ±
2 √ 2 (45)
2 2 a2 + 2a2 C
a C 4Λ
b2 + 2Λab − = 0 =⇒ b = −Λa ± .
2 2
For ease of notation, we denote
√
4S 2 + 2a2 C
S = Λa, ξ = , σ+ = S + ξ, σ− = S − ξ (46)
2
22
Published as a conference paper at ICLR 2025
Step IV. It is now left for us to determine A, B, E and F according to the initial condition, which
can be obtained by noting the conserved quantity gh in Eq. (38) and can be satisfied when
σ+ Aeσ+ t + Bσ− eσ− t σ− Ee−σ− t + σ+ F e−σ+ t
= 2C (50)
Aeσ+ t + Beσ− t Ee−σ− t + F e−σ+ t
2 2 a2 C
=⇒ σ+ AF + σ− BE + (AF + BE) = 0. (51)
2
Noticing that
2
σ+ = S 2 + 2Sξ + ξ 2 , σ− 2
= S 2 − 2Sξ + ξ 2 ,
we can further simplify Eq. (51) to
AF σ+ = BEσ− . (52)
On the other hand, considering the initial condition by letting t = 0 in Eq. (49)
λ 1 1
f (0) = Λ + − , (53)
2 E/F + 1 A/B + 1
if we denote
E σ− 2Λ − λ 2(f (0) − Λ)
P = , Q̄ = = , D= , (54)
F σ+ 2Λ + λ λ
then one can easily see that A/B = Q̄E/F and Eq. (53) becomes
D(1 + P )(1 + P Q̄) + P − P Q̄ = 0
p
−[D(Q̄ + 1) + 1 − Q̄] ± [D(Q̄ + 1) + 1 − Q̄]2 − 4D2 Q̄ (55)
=⇒ P = .
2DQ̄
Recall that g, h > 0 according to their definitions (H is positive-definite), we only take the minus
sign in the above solution, which can be simplified by conducting some tedious algebra as
√ p
4f (0)Λ + 2C + 4Λ2 + 2C 4f (0)2 + 2C
P = √ (56)
2(f (0) − Λ)( 4Λ2 + 2C) − 2Λ
23
Published as a conference paper at ICLR 2025
We can now summarize the solution of self-attention to the zero-th order of ϵs to prove Theorem 3.1
by recovering the subscript s in Eq. (49) with solution of P in Eq. (56):
0 λs 1 1
fs (t) = Λs + − (57)
2 1 + Ps exp(as λs t) 1 + Qs exp(as λs t)
When vs0 (0) = ws0 (0), we have Cs = 0, which implies that F = σ− = 0 according to Eq. (46) and
Eq. (51), and
λs = 2Λs .
As a result, we can rewrite the solution as
Λs Λs
fs0 (t) = Λs − = Λs −fs0 (0)
(58)
1 + A/Be2as Λs t 1+ exp(−2as λs t)
fs0 (0)
where we use the initial condition Eq. (53) in the second equality.
We first present the overall procedure for deriving neural scaling laws in F.1, then discuss them in
detail for fixed context sequence length in F.2 and for varied context length in F.3. We use a ∼ b
to mean that a is approximately equal to b (by neglecting irrelevant coefficients and constants) if
a, b ∈ R.
F.1 P ROCEDURE
For convenience, we first present the test loss described in the main paper
N
1X s
2
L(t) ≈ Pα (S = s) fs0 (t) − Λs . (59)
2 s=1
Note that as as is determined by the dataset and it satisfies the following relation
ψs #s (Λ2s + 1)
as = → Pα (S = s)ψs (Λ2s + 1) = Zs−α ψs (Λ2s + 1) (60)
N
otherwise it would be
λs 1 1
lim fs0 (t) = fs0 (0) = Λs + − , (62)
t→∞ 2 1 + Ps 1 + Qs
which means that self-attention cannot learn the task s if there is no data point for it in the training set
and is similar to the one shot learner property of diagonal linear networks in Nam et al. (2024). These
properties will be repeatedly applied in the following sections. And we now discuss procedures for
deriving different neural scaling laws.
24
Published as a conference paper at ICLR 2025
Size of the model D. To quantify the model size D, we assume that there is a cutoff D such that
the model cannot learn any task strength Λs with s ≥ D. As a result, according to the solution for
the task s in Eq. (57), we have
∀t ≥ 0, s ≥ D : fs0 (t) − Λs = fs0 (0) − Λs . (63)
When D is the bottleneck of training, we let t, N → ∞ (they are sufficient for the training) to derive
scaling laws respect to D. Thus the over all test loss will be
Ns
1X 2
L(D) ∼ Zs−α fs0 (0) − Λs (64)
2
s=D
To derive the neural scaling law of model size D, we only need to find the asymptotic behavior of
Eq. (64) by letting Ns → ∞ and replacing the summation with an integral.
Training time t. Training time t is equivalent to the number of optimization steps. To investigate
the scaling law respect to t, we remove the bottleneck caused by the size of model and the number of
data points by letting N → ∞ (thus we have Eq. (60)) and D = Ns (thus Eq. (63) does not satisfy
for any D ∈ {1, . . . , Ns }). As a result, the overall test loss will be
Z
1 ∞ 2
L(t) ∼ Zs−α fs0 (t) − Λs ds (65)
2 s=1
where we let Ns → ∞ and replace the summation with integral. We sill study the asymptotic
behavior of Eq. (65) with the Laplace method (Bender & Orszag, 1978) to investigate the time
scaling law.
Number of training data points N . When the training is bottlenecked by N , we let t → ∞ (thus
Eq. (61) will be satisfied for all task types s if there exist training data points for them otherwise
Eq. (62) would be satisfied) and the cutoff D = Ns (thus Eq. (63) does not satisfy for any D ∈
{1, . . . , Ns }). As a result, we conclude that the probability of limt→∞ fs0 (t) = fs0 (0) is exactly the
same as the probability that the training data set {Φ(s(n) , X(n) )}N n=1 does not have any training data
point for the task s, i.e, ∀n ∈ {1, . . . , N } : s(n) ̸= s. Therefore, we can rewrite the test loss as
Z
1 ∞ 2
L(N ) ∼ Zs−α fs0 (0) − Λs (1 − P(S = s))N ds
2 s=1
Z (66)
1 ∞ 2
= Zs−α fs0 (0) − Λs (1 − Zs−α )N ds
2 s=1
where, again, we let Ns → ∞ and replace the summation with the integral, and we sill study the
asymptotic behavior of Eq. (65) with the Laplace method to investigate the data scaling law.
Optimal compute C. This is the case when the number of data points is sufficient for the training
(N → ∞), while training time t or the size of model D is the bottleneck given the compute budget
C = Dt such that either t or D scales differently with C. Specifically, if
L(t, D) = at t−αt + aD D−αD ,
then we can rewrite the test loss as
L(D) = at C −αt Dαt + aD D−αD . (67)
To obtain the optimal loss given D, we let
1 − α +α
1
aD αD αt +αD α +α
αt aD αD t D αD
∂D L(D) = 0 =⇒ D = C t D, t= C αt +αD .
at αt at αt
As a result, we can derive the optimal compute budget test loss as
L(C) ∝ C −αt αD /(αt +αD ) (68)
given C = tD, where αt , αD can be obtained from the neural scaling laws for time and model size,
respectively.
25
Published as a conference paper at ICLR 2025
F.2 N EURAL S CALING L AWS WITH F IXED S EQUENCE L ENGTH AND S TRENGTH
In this case, according to Eq. (16), the test loss can be written as
Ns 2
Zλ2 X −α 1 1
L(t) ≈ s − .
8 s=1 1 + P exp [s−α Zλψ(Λ2 + 1)t] 1 + Q exp [s−α Zλψ(Λ2 + 1)t]
(69)
In the following, we will investigate neural scaling laws using the above test loss and the procedures
described in F.1.
Model Scaling Law. According to Eq. (64), the model scaling law can be obtained from studying
the behavior of
Z ∞ 2
2
−α λ 1 1
L(D) ∼ Zs − ds ∝ D−α+1 (70)
s=D 8 1+P 1+Q
where D is the cuttoff of the task such that our model will only learn the first D − 1 tasks and we
let t → ∞.
Time scaling law. Let Ns → ∞ and replace the summation with integral in the test loss, we have
(omitting irrelevant coefficients and we denote ψ̄ = Zλψ(Λ2 + 1) for ease of notation)
Z ∞ 2
1 1 −α
L(t) ∼ −s−α ψ̄t + P
− −s−α ψ̄t + Q
e−2s ψ̄t−α ln s ds (71)
1 e e
Now let
α
F (s) := s−α ψ̄ +
ln s
t
then, applying the Laplace method (Bender & Orszag, 1978), for large t
Z ∞
L(t) ∝ e−F (s)t ds
1
Z c+ε
a ′′ 2
∼ e−(F (c)+F (c)(s−c) )t ds
c−ε
Z c+ε
′′
(c)t(s−c)2
∼ e−F (c)t e−F ds
c−ε
√
b −F (c)t 2π
∼e p (72)
′′
F (c)t
where we expand F (s) around its minimal F (c) in a and b is simply the Gauss integral. We can
solve F ′ (c) = 0 to determine the value of c:
α 1
−αc−α−1 ψ̄ + = 0 =⇒ c = (tψ̄) α . (73)
ct
This further gives us
ln(ψ̄t) 1
F (c) = t−1 + =⇒ e−F (c)t = (74)
t eψ̄t
and
α
F ′′ (c) = α(α + 1)ψ̄c−α−2 − 2
c t
(α+2) α
= α(α + 1)ψ̄(tψ̄)− α − 2 1+ 2
ψ̄ α t α
2 2
= α2 ψ̄ − α t−1− α . (75)
′′
Putting F (c) and F (c) obtained above back to Eq. (72) immediately gives us the time scaling law:
√
2π −1+α−1 α−1
L(t) ∼ 1− 2 t ∝ t− α . (76)
eαψ̄ α
26
Published as a conference paper at ICLR 2025
Data Scaling Law. In this case the bottleneck of training is the number of data while t → ∞ and
D = Ns . According to Eq. (66), the test loss has the form of
Z ∞ 2
λ2 1 1
L(N ) ∼ Zs−α − (1 − Zs−α )N ds
8 1 + P 1 + Q
Z1 ∞
−α
∼ e−N (α ln s/N −ln(1−Zs )) ds. (77)
1
We apply the Laplace method again to study the asymptotic behavior of L(N ) to derive the data
scaling law. Let
ln s
F (s) = α − ln(1 − Zs−α ), (78)
N
then we can expand F (s) around its minimal F (c) where the value of c is determined by F ′ (c) = 0:
α αZs−α−1
F ′ (s) = − (79)
N s 1 − Zs−α
1
=⇒ c = ((N + 1)Z) α (80)
The loss function can be written as
√
2π
L(N ) ∼ e−F (c)N p , (81)
F ′′ (c)N
where
ln((N + 1)Z) 1
F (c) = − ln 1 −
N N +1
∼ N −1 ln(N Z) + N −1 (82)
′′
where we assume that N ≫ 1 in the second line. It is now left for us to find the value of F (c),
which is done as follows.
α(N + 1) αc−2 α Zαc−α−1
F ′′ (s)|s=c = −c−2 + −α
+
N 1 − Zc c (1 − Zc−α )2
Zc−α Zαc−α 1
= αc−2 + −
1 − Zc−α (1 − Zc−α )2 N
2
∼ α(α + 1)Zc−α c−2 ∼ α(α + 1)Z − α N −(α+2)/α
where we use N ≫ 1 again in the last line. Putting F (c) and F ′′ (c) back to Eq. (81) gives us the
scaling law with respect to N :
√
1 2π α−1
L(N ) ∼ 1
p ∝ N− α . (83)
N Ze Z α α(α + 1)N N
− −1−2/α
F.3 N EURAL S CALING L AWS WITH VARIED S EQUENCE L ENGTH AND S TRENGTH
In general, we can derive neural scaling laws with a similar spirit as in the previous section. Ad-
ditionally, we assume for simplicity that Λ2s ≫ 1 such that as ≈ #s ψs Λ2s /N and the model is
initialized as vs0 (0) = ±ws0 (0) and fs0 (0) = O(1) for all s. As a result, the test loss Eq. (17):
Ns 2
ZX ∆ exp(−2as Λs t)
L(t) ∝ s−α−2γ , (84)
2 s=1 1 + ∆ exp(−2as Λs t)
which can be easily derived using Theorem 3.1 and ψs ∝ s−β , Λs ∝ s−γ and we also initialize
the model such that the initial prediction of the model is equally away from the true strength for
different tasks to exclude influence from other aspects, i.e., Λs /fs0 (0) are similar for all s. As we
assume that Λ2s ≫ 1 in Section 4.2, we can rewrite as when N → ∞ as
as ∼ Z Z̄s−α−β−2γ (85)
where we use Z̄ to denote irrelevant normalization constants.
27
Published as a conference paper at ICLR 2025
Model scaling law. According to Eq. (64), the model scaling law can be obtained from studying
the behavior of
Z ∞ 2
∆
L(D) ∼ Zs−α−2γ ds ∝ D−α−2γ+1 (86)
s=D 1+∆
where D is the cuttoff of the task such that our model will only learn the first D − 1 tasks and we
let t → ∞.
Time scaling law. With a similar procedure as in previous section, we will apply the Laplace
method to derive the time scaling law. Specifically,
h i
Z ∞ exp − 4Z̃s−α−β−3γ + α+2γ ln s t
t
L(t) ∝ ds (87)
1 (1 + ∆e−2as Λs t )2
where we use Z̃ to absorb all irrelevant constants. Now we let
α + 2γ
F (s) = 4Z̃s−α−β−3γ + ln s, (88)
t
then the asymptotic behaviors of L(t) can be written as
√
−F (c)t 2π
L(t) ∼ e p (89)
′′
F (c)t
where F ′ (c) = 0 as before. Note that the first derivative of F (s) is
α + 2γ
F ′ (s) = −4Z̃(α + β + 3γ)s−(α+β+3γ+1) + (90)
st
α+β+3γ
1
1 α + β + 3γ
=⇒ c := (M̃ t) α+β+3γ = 4Z̃ t . (91)
α + 2γ
Therefore, we obtain that at s = c
α + 2γ
F (c) = 1 + ln(M̃ t) t−1 (92)
α + β + 3γ
1 α+2γ
=⇒ e−F (c)t = α+2γ t− α+β+3γ . (93)
(eM̃ ) α+β+3γ
Furthermore, the second derivative of F (s) with respect to s is
α + 2γ
F ′′ (s) = M̃ (α + 2γ)(α + β + 3γ + 1)s−(α+β+3γ+2) − , (94)
s2 t
which gives us
2 2
F ′′ (c) = M̃ − α+β+3γ t−1− α+β+3γ (α + 2γ) (α + β + 3γ) . (95)
Putting F (c) and F ′′ (c) back to Eq. (87) gives us the time scaling law
α+2γ−1
L(t) ∝ t− α+β+3γ . (96)
Data Scaling Law. Similar to previous section, in this case the bottleneck of training is N and we
let t → ∞ and D = Ns . According to Eq. (66), the test loss will become
Z ∞
L(N ) ∝ s−α−2γ (1 − Zs−α )N ds
1
Z ∞
−α
= e−N ((α+2γ) ln s/N −ln(1−Zs )) ds. (97)
1
Following a similar procedure, we let
ln s
F (s) = (α + 2γ) − ln(1 − Zs−α ) (98)
N
28
Published as a conference paper at ICLR 2025
For all numerical experiments, we generate the dataset exactly as the process described in Sec-
tion 2.2. The model structure is a linear self-attention as specified in Section 2.3. If not specified,
we set the initialization as
vs (0) = A × 1Ns +1 , ws (0) = vs (0) + 0.1 × A × 1Ns +1 . (106)
where A = 0.1 is a constant and we use 1d ∈ Rd to represent a vector with all elements equal to 1.
For the discrete GD training, we set the learning rate as 10−3 and the number of total optimization
steps as 5000. The theoretical prediction using the solution fs0 (t) is simulated with the forward
Euler method such that t = kη where k is the optimization step and η is the learning rate.
Fig. 3. Ns = 500. We set the context sequence length as ψ = 100, and the task strength Λ = 0.5.
29
Published as a conference paper at ICLR 2025
In this section, we demonstrate the generality of the MSFR problem. Specifically, we show that
our method can also be applied to (or is a limiting case of) other generalized types of the MSFR
setup considered in Section 2.1, namely multitask in-context regression under the source-capacity
condition (Appendix H.1), MSFR with approximately sparse feature (Appendix H.2), and tasks
with idempotent-like Hs (Appendix H.3). We also highlight that our solution can be applied to
study other properties of attention besides the neural scaling laws considered in Section 4.1 and 4.2.
Interestingly, the MSFR problem can be seen as a limiting case of the multitask version of the in-
context regression under the source-capacity condition (Cui et al., 2022), which is defined as follows
and can be seen as a generalization of the setup in Lu et al. (2024).
Multitask in-context regression under the source-capacity condition. Following the settings
of the MSFR problem in Section 2.1, there are Ns different tasks in total. We let S be the ran-
dom variable of picking a specific task among Ns tasks and assume that S follows the power law
distribution Eq. (1). In the following, we will show that each task is constructed as an in-context
regression, thus we term this setting as multitask in-context regression. We do not use the sparse
feature extractor ϕ(s, x) defined in Eq. 2. Instead, following Cui et al. (2022), we use the feature
extractor ϕ̃(s, x) ∈ RNs such that for data x ∼ PX
h i
Σs = Ex∼PX ϕ̃(s, x)ϕ̃(s, x)T = diag (ω̃ s ) = diag (Ps (ω)) (108)
T T
where ω̃ s = ω̃1s ω̃2s · · · ω̃N s
s
∈ RNs and ω = [ω1 ω2 ··· ωNs ] ∈ RNs such that ω
satisfies the source/capacity condition
ωk ∝ k −τ , (109)
and Ps is a simple rearrangement of elements of ω such that
ω̃ss = ω1
s
ω̃s+1 = ω2
..
.
s
ω̃N s
= ωNs −s+1 (110)
ω̃1s = ωNs −s+2
..
.
s
ω̃s−1 = ωNs
i.e., the s-th eigenvalue of Σs is the largest given task type s. Finally, given task type s, we let the
strength for task s be Λs ∈ RNs and the target y ∈ R is
The in-context regression data Φ̃(s, X) is now generated according to the process in Section 2.2:
ϕ̃(s, x(1) ) · · · ϕ̃(s, x(ψs ) ) ϕ̃(s, x̂)
Φ̃(s, X) = (112)
y(s, x(1) ) · · · y(s, x(ψs ) ) 0
while we now assume that the sequence length ψs is fixed for each task s.
30
Published as a conference paper at ICLR 2025
= 1.8
t ( 1)/ , = 1.8
= 2.1
t ( 1)/ , = 2.1
10 2
10 3
L( )
L(t)
D=5
D = 10
D = 15
D = 20
D = 25
D = 30
D = 35
10 3 D = 40
( 1)/( + 1)
Figure 5: Neural scaling laws for softmax self-attention in the multitask in-context regression under
the source-capacity condition. In each figure, we use solid lines to represent empirical simulation
results and dashed lines for power law curves. In (b), we set α = 1.8.
31
Published as a conference paper at ICLR 2025
MSFR with approximately sparse feature. For the MSFR problem in Section 2.1, we now con-
sider a new feature extractor ϕ̃(s, x) such that
ϕ̃(s, x) = ϕ(s, x) + ζ(s, x), (116)
where ζ(s, x) ∈ R can be a random noise to the first order of ϵs (ζ does not need to be sparse).
Ns
We call this task MSFR with approximately sparse feature. This task will give us the same set of
non-linear ODEs to the zero-th order of ϵs under Assumption 3.1 as that for the original MSFR
problem in Section 2.1. Therefore, Theorem 3.1 can still be applied in this case.1 .
Numerical Experiments. In Fig. 6, we let ζ ∼ N (0, ϵ2s I) be a Gaussian noise vector for each
task s. We compare the loss calculated according to fs0 (t) in Theorem 3.1 with that obtained from
empirical simulation. It can be seen that our theoretical prediction is still highly exact with the
existence of the noise vector ζ when the context sequence length ψs is large.
0.030 2.00
1.75
0.025
1.50
1.25
L(t), = 100
L(t), = 10
0.020
1.00
0.015
0.75
0.010 0.50
0.25
0.005
0.00
100 101 102 103
t
Figure 6: Loss L(t) of MSFR with approximately sparse feature for different context sequence
lengths ψ(10 and 100) during training. Solid lines are for theoretical predictions while dashed lines
are for empirical simulations.
From a mathematical perspective, besides the MSFR problem considered in this paper, our strat-
egy for solving the ODEs Eq. (9) can be applied to any cases when the matrix Hs in Eq. (9) is
idempotent-like without Assumption 3.1:
Hs2 = µs Hs (117)
where µs ∈ R is a constant. In such cases, the solution of the model prediction is still fs0 (t) in
Theorem 3.1 except for that we now define as = #s ψs µs /N and fs0 (t) is exact as we do not
need Assumption 3.1. We think it will be an interesting future direction to explore other tasks (for
self-attention or other machine learning models) where Hs has the idempotent-like structure.
Numerical Experiments. To verify the above claim, we consider a simple example where (we
omit the subscript s and consider the case where we only have one type of task)
X3
H= Ei ui uTi , ui · uj = δi,j (118)
i=1
and we let Ei = 2 for i = 1, 2, 3, which will give us µ = 2 in Eq. (117). The learning dynamics
is Eq. (9) with #s = N . In Fig. 7, we compare the loss calculated according to the solution
in Theorem 3.1 with that obtained from empirical simulation. It can be seen that our theoretical
prediction matches with the empirical simulation well because it is an exact solution in this case.
1
We note that our characterization of fs1 (t) in Appendix J is no longer applicable in this case, and we believe
the characterization of fs1 (t) and the generalization of our methods to more complicate feature extractor ϕ̃(s, x)
can be an interesting future direction.
32
Published as a conference paper at ICLR 2025
10 1
10 2
L(t)
10 3
Theory
Simulation
100 101 102 103
t
Figure 7: Loss L(t) for learning dynamics Eq. (9) with idempotent-like H.
In this section, we conduct additional experiments to explore the generality of our conclusion for the
neural scaling laws. In particular, in Appendix I.1, we explore the neural scaling laws of softmax
self-attention for the MSFR problem where we train the model with GD, while in Appendix I.2 we
train the model with AdamW.
I.1 N EURAL S CALING L AWS OF S OFTMAX S ELF -ATTENTION FOR MSFR P ROBLEM
We replace the linear self-attention with the softmax self-attention in the numerical experiments
of Fig. 3 and Fig. 4 to investigate the neural scaling laws for the MSFR problem. For complete-
T
ness, we adopt the WK WQ decomposition
rather than a single merged WKQ , i.e., f (G; θ) =
T T
V G softmax G WK WQ G . All the other settings are the same as those of Section 4.1 and 4.2.
Fixed Context Sequence Length. In Fig. 8, we report the neural scaling laws when the context
sequence length is fixed as in Section 4.1. It can be seen that the scaling laws with respect to time t,
model size D, data size N , and the optimal compute C are similar to those reported in Table 1.
Varied Context Sequence Length. For the varied context sequence length, we let ψs = F(s) ∝
s−β as in Section 4.2 while we keep Λs fixed. We note that the neural scaling laws with respect to
the model size D and data size N are not affected by a varied context sequence length as reflected
in Table 2, which is due to the fact that GD can still learn the task strength Λs for the task s as
t → ∞ when the context sequence length is varied. We report the scaling laws with respect to time
t in Fig. 9, where we can see that the softmax self-attention still admits a similar time scaling law
compared to the linear self-attention for varied context sequence length. As a result, the optimal
compute scaling law of softmax self-attention will also be similar to that of linear self-attention, as
it is a consequence of the time scaling law and model size scaling law and these laws do not change.
These numerical experiments reveal that our claims regarding neural scaling laws for the linear
self-attention can be generalized to the softmax self-attention.
To examine the effects of optimization algorithms on neural scaling laws in the MSFR problem,
T
we train softmax self-attention with AdamW and we also use the WK WQ parameterization. We
focus on the case where the context sequence length and the task strength are fixed. We present our
parameters in the following table.
33
Published as a conference paper at ICLR 2025
= 1.8 = 1.8
10 2 D + 1, = 1.8 t ( 1)/ , = 1.8
= 2.1 = 2.1
D + 1, = 2.1 t ( 1)/ , = 2.1
10 2
L(D)
10 3
L(t)
10 4
10 3
10 2
L(N)
L( ) 10 3 D=5
D = 10
D = 20
D = 25
D = 50
10 3 ( 1)/( + 1)
Figure 8: Neural scaling laws for softmax self-attention trained by GD with different values of
α = 1.8, 2.1 when the context sequence length is fixed. In each figure, we use solid lines to represent
empirical simulation results and dashed lines for power law curves. In (d), we set α = 1.8.
Simulation
t ( 1)/( + )
t ( 1)/ , = 0
100
L(t)
6 × 10 1
4 × 10 1
3 × 10 1
102 103
t
Figure 9: Neural scaling laws with respect to time t for softmax self-attention trained by GD with
α = 1.8 when the context sequence length is fixed. We let N → ∞, D = Ns . Solid lines represent
empirical simulation results while dashed lines represent power law curves obtained from Table 2.
Neural scaling laws with respect to model size D and data size N . We expect that AdamW will
show similar neural scaling laws with respect to the model size D and data size N when compared
to GD. This is because AdamW can still learn the task strength Λs for the task s given sufficient
training time t (Fig. 10c), which is similar to GD. We report the corresponding neural scaling laws
34
Published as a conference paper at ICLR 2025
in Fig. 10a and 10b, where it can be seen that the softmax self-attention trained by AdamW still
admits similar neural scaling laws with respect to D and N .
Neural scaling law with respect to time t. However, AdamW typically exhibits a very different
dynamics during training compared to GD, as AdamW has a very different learning dynamics (e.g.,
it converges faster than GD). Thus we expect that AdamW will lead to a very different time scaling
law (Fig. 10c), which will further lead to a different neural scaling law for the optimal compute
(Fig. 10d). We additionally note that these observations are similar to the observations in Hoff-
mann et al. (2022), where the authors revealed that, when compared to Adam, AdamW shows a
different test loss behavior against the optimization steps (training time), indicating that the type of
optimization algorithm can affect the time scaling laws.
= 1.8 = 1.8
D + 1, = 1.8 N ( 1)/ , = 1.8
= 2.1 = 2.1
10 2 D + 1, = 2.1 N ( 1)/ , = 2.1
10 2
L(D)
L(N)
10 3
10 3
10 4
100 101 102 100 101 102 103
D N
(a) Model Size Law with N, t → ∞ (b) Data Size Law with t → ∞, D = Ns
= 1.8 10 2
t ( 1)/ , = 1.8
= 2.1
t ( 1)/ , = 2.1
10 1
D=5
L( )
L(t)
D=8
D = 10
D = 15
D = 20
10 3
D = 25
D = 30
10 2
D = 40
D = 50
( 1)/( + 1)
Figure 10: Neural scaling laws for softmax self-attention trained by AdamW with different values of
α = 1.8, 2.1. In each figure, we use solid lines to represent empirical simulation results and dashed
lines for power law curves that are obtained Table 1 (when the self-attention is trained by GD). In
(d), we set α = 2.1.
35
Published as a conference paper at ICLR 2025
We first discuss how to derive the solution of model parameters vs0 (t) and ws0 (t) to the zero-th order
of ϵs , then present the solution to the first order of ϵs , which can give us the complete solution
of self-attention up to the first order of ϵs . In the following sections, we omit the subscript s for
convenience and recover it in the final solution.
E.2 gives us the solution of the model output fs0 (t). To obtain the forms of v 0 (t) and w0 (t), we
only need to solve η 0 and ρ0 since
η 0 + ρ0 η 0 − ρ0
v 0 (t) = , w0 (t) =
2 2
according to their definitions. The ODEs of η 0 and ρ0 Eq. (31) can be rewritten using the compo-
nents (note that ηs0 is the s-th component of η 0 ) as:
ψ# 0
η̇s0 = − g − h0 − 4Λ (ηs0 + ΛηN0
s +1
)
4N
0 ψ# 0
η̇N s +1
= −Λ g − h0 − 4Λ (ηs0 + ΛηN 0
s +1
)
4N (119)
ψ# 0
ρ̇0s = g − h0 − 4Λ (ρ0s + Λρ0Ns +1 )
4N
0 ψ# 0
ρ̇Ns +1 = Λ g − h0 − 4Λ (ρ0s + Λρ0Ns +1 ).
4N
An interesting property of these ODEs is that
d
(Ληs0 − ηN
0
s +1
) = 0 =⇒ Ληs0 − ηN
0
s +1
= C̄, Λρ0s − ρ0Ns +1 = C̃, (120)
dt
0
which can gives us a relation between ηs0 and ηN s +1
and a similar one between ρ0s and ρ0Ns +1 . On
the other hand, since we already know the solution of g 0 from E.2 and g 0 can be written as
g 0 = (ηs0 + ΛηN
0
s +1
)2 , (121)
we can solve η 0 and ρ0 based on these relations, which will also give us v 0 and w0 .
We now discuss the solution to the first-order of ϵs . According to the definition of gs1 and h1s in
Eq. (28) and (29), they can be rewritten as
g 1 = ψη 0 · H 1 η 0 + 2m
(122)
h1 = ψρ0 · H 1 ρ0 + 2n
where m and n follow the dynamics Eq. (35) and Eq. (36). Therefore, to obtain the formulations
of g 1 and h1 , the ODEs of m and n will be the only equations that need to be solved since we
can obtain η 0 · H 1 η 0 directly from J.1. In the following, we focus on how to solve m and n. For
convenience, we first present ODEs for m and n ( Eq. (35) and Eq. (36) without the subscript s)
0 0
am ψ 2 # 0 ag 0
ṁ = − g − h − 4Λ + η · H H η − (g 1 − h1 )
0 1 0
,
2 4N 4
(123)
0 0
an ψ 2 # 0 0 1 0 1 1 ah
0
ṅ = g − h − 4Λ + ρ · H H ρ + (g − h ) .
2 4N 4
36
Published as a conference paper at ICLR 2025
The above equations are too complex, thus we attempt to reformulate them to simpler forms: we
can obtain a new set of ODEs from the above equations
d
mh0 = ḣ0 m + ṁh0
dt
#ψ 2 0 ag 0 h0
= −(g 0 − h0 − 4Λ) η · H 0 H 1 η 0 h0 − (g 1 − h1 ) (124)
4N 4
d 0
ng = ġ 0 n + ṅg 0
dt
#ψ 2 0 ag 0 h0
= (g 0 − h0 − 4Λ) ρ · H 0 H 1 ρ0 g 0 + (g 1 − h1 ) , (125)
4N 4
which implies that
d #ψ 2 0
(mh0 + ng 0 ) = −(g 0 − h0 − 4Λ) η · H 0 H 1 η 0 h0 − ρ0 · H 0 H 1 ρ0 g 0 . (126)
dt 4N
The above equation gives us a relation between mh0 and ng 0 . Fortunately, according to the defi-
nitions of g 0 , h0 , H 0 , and H 1 (Eq. (7)), we can expand the terms inside the second bracket of the
above equation:
p
η 0 · H 0 H 1 η 0 h0 − ρ0 · H 0 H 1 ρ0 g 0 = g 0 h0 ηs0 (ρ0s + Λρ0Ns +1 ) − (ηs0 + ΛηN 0
s +1
)ρ0s
√
= 2C(C̄ρ0Ns +1 − C̃ηN 0
s +1
) (127)
where we use Eq. (120) in the second equality. If the model is initialized as C̄ = C̃ = 0, then under
this condition, we can immediately conclude that
d
(mh0 + ng 0 ) = 0 =⇒ ∀t ≥ 0 : mh0 = −ng 0 + Ĉ, (128)
dt
where Ĉ is determined by the initial condition and we let Ĉ = 0 in the following. Noting that
g 1 − h1 := r + 2(m − n) = ψ(η 0 · H 1 η 0 − ρ0 · H 1 ρ0 ) + 2(m − n), (129)
0
Eq. (128) allows us to simplify the ODEs for m and h further by interchangeably using mh and
−ng 0 :
am ng 0 ag 0 r ψ 2 # 0
ṁ = − g 0 − h0 − 4Λ + g 0 − − − g − h0 − 4Λ η 0 · H 0 H 1 η 0
2 m 4 4N
0
ag r ψ # 0 2
= −am g 0 − 2Λ − − g − h0 − 4Λ η 0 · H 0 H 1 η 0 (130)
4 4N
ah0 r ψ 2 # 0
ṅ = −an h0 + 2Λ + + (g − h0 − 4Λ)ρ0 · H 0 H 1 ρ0 . (131)
4 4N
Eq. (130) and (131) are exactly solvable since they are simply first order linear ODEs. Specifically,
let
ϱ(t) = −a g 0 − 2Λ ,
ag 0 r ψ 2 # 0 (132)
ϑ(t) = − − g − h0 − 4Λ η 0 · H 0 H 1 η 0
4 4N
then Eq. (130) can be rewritten as
ṁ = ϱ(t)m + ϑ(t). (133)
The standard procedure for solving this is letting u̇ = −uϱ and multiplying u to both sides of
Eq. (133), then we obtain
Z R
d u(t)ϑ(t)dt + const .
um = uϑ =⇒ um = u(t)ϑ(t)dt + const . =⇒ m = . (134)
dt u
Similarly, to solve n, we let
ε(t) = a h0 + 2Λ
ah0 r ψ 2 # 0
φ(t) = + (g − h0 − 4Λ)ρ0 · H 0 H 1 ρ0 (135)
4 4N
ż = zε(t)
37
Published as a conference paper at ICLR 2025
then
R
z(t)φ(t)dt + const .
n= . (136)
z
As a result, the solution of self-attention up to the first order of ϵs for the task type s under Assump-
tion 3.1 now becomes
g 0 − h0
Solution: f =
4 R R
u(τ )ϑ(τ )dτ + const . z(τ )φ(τ )dτ + const .
+ϵ r+2 − . (137)
u z
We examine each term in Eq. (137) in the following. Since we let C̄ = C̃ = 0 in Eq. (120), g 0 and
h0 can be written explicitly as
g 0 = ψ(Λ2 + 1)2 (ηs0 )2 , h0 = ψ(Λ2 + 1)2 (ρ0s )2 (138)
where
Z Z Z
0 γ̇
− ϱ(t)dt = a(g − 2Λ)dt = −2aΛt + 2 dt
γ
= −2aΛt + 2 ln γ (141)
where we use Eq. (48) in the second line. Putting the above integral back to the expression of u, we
obtain
2
u = (e−aΛt γ)2 = Aeξt + Be−ξt , (142)
√ R
where ξ = 4Λ2 a2 + 2a2 C/2 is defined in Eq. (46). We now derive uϑdt. To start, we examine
each term of ϑ first:
ag 0 r a (g 0 )2 − 2C
− =−
4 4(Λ2 + 1)2
2
ψ # 0 ψ#
− g − h0 − 4Λ η 0 · H 0 H 1 η 0 = − 2
g 0 − h0 − 4Λ g 0
4N 4(Λ + 1)N
(143)
a (g 0 )2 − 2C − 4Λg 0
=−
4(Λ2 + 1)2
a (g 0 )2 − 2C − 2Λg 0
=⇒ ϑ = − .
2(Λ2 + 1)2
Using this in the integral and considering the form of u in Eq. (142), we have
Z Z
a
uϑdt = − 2 2
(Aeξt + Be−ξt )2 (g 0 )2 − 2C − 2Λg 0 dt
2(Λ + 1)
Z Z
2
−ξt 2 aC
=− ξt
Aσ+ e + Bσ− e dt + 2 (Aeξt + Be−ξt )2 dt
a(Λ2 + 1)2 (Λ + 1)2
Z
2Λ
+ 2 2
(Aeξt + Be−ξt ) Aσ+ eξt + Bσ− e−ξt dt (144)
(Λ + 1)
where we frequently use the solution of g 0 in Eq. (33).
38
Published as a conference paper at ICLR 2025
R R
Form of z and
R z(τ )φ(τ )dτ . By the similar procedure of deriving u and uϑdt, we can also
derive z and zφdt.
Z
2
z = exp εdt = (eaΛt θ)2 = Eeξt + F e−ξt (145)
where we use Eq. (48) in the second equality. Similar to the derivation of ϑ, we can derive φ as
follows:
ah0 r a (h0 )2 − 2C
=−
4 4(Λ2 + 1)2
2
ψ # 0 ψ#
g − h0 − 4Λ ρ0 · H 0 H 1 ρ0 = 2
g 0 − h0 − 4Λ h0
4N 4(Λ + 1)N
(146)
a (h0 )2 − 2C + 4Λh0
=−
4(Λ2 + 1)2
a (h0 )2 − 2C + 2Λh0
=⇒ φ = −
2(Λ2 + 1)2
thusZ Z
a
zφdt = − (Eeξt + F e−ξt )2 (h0 )2 − 2C + 2Λh0 dt
2(Λ2 + 1)2
Z Z
2
−ξt 2 aC
=− Eσ − e ξt
+ F σ + e dt + (Eeξt + F e−ξt )2 dt
a(Λ2 + 1)2 (Λ2 + 1)2
Z
2Λ
+ 2 2
(Eeξt + F e−ξt ) Eσ− eξt + F σ+ e−ξt dt, (147)
(Λ + 1)
where we frequently use the solution of h0 in Eq. (33).
Results of integrals. It is now left for us to solve all the integrals to obtain the complete solution.
We list the results below.
1. Z
A2 σ+ 2ξt B 2 σ− −2ξt
(Aeξt + Be−ξt )(σ+ Aeξt + Bσ− e−ξt )dt = e − e + 2ABΛat
2ξ 2ξ
Z
E 2 σ− 2ξt F 2 σ+ −2ξt
(Eeξt + F e−ξt )(Eσ− eξt + F σ+ e−ξt )dt = e − e + 2EF Λat
2ξ 2ξ
2. Z
A2 σ +
2
B 2 σ−
2
(σ+ Aeξt + Bσ− e−ξt )2 dt = e2ξt − e−2ξt + 2σ+ σ− ABt
2ξ 2ξ
Z
E 2 σ−
2
F 2 σ+
2
(σ− Eeξt + σ+ F e−ξt )2 dt = e2ξt − e−2ξt + 2σ+ σ− EF t
2ξ 2ξ
3. Z
A2 2ξt B 2 −2ξt
(Aeξt + Be−ξt )2 dt = e − e + 2ABt
2ξ 2ξ
Z
E 2 2ξt F 2 −2ξt
(Eeξt + F e−ξt )2 dt = e − e + 2EF t.
2ξ 2ξ
Complete solution. With these integrals, we are now ready to find the explicit forms of the solu-
tion Eq. (137). In particular, we have
Z
A2 e2ξt 2 a2 C B 2 e−2ξt 2 a2 C
uϑdt = −σ+ + + aΛσ+ + σ− − − aΛσ−
ξ(Λ2 + 1)2 a 2 ξ(Λ2 + 1)2 a 2
ABt
+ 2 2
4Λ2 a2 + 2a2 C − 4σ+ σ−
(Λ + 1) a
2 2ξt
A e Λ Λa B 2 e−2ξt Λ Λa 4aABt
=− 2 1+ + 2 −1 + + 2 Λ2 + C (148)
(Λ + 1)2 ξ (Λ + 1)2 ξ (Λ + 1)2
39
Published as a conference paper at ICLR 2025
and
Z
E 2 e2ξt 2 a2 C F 2 e−2ξt 2 a2 C
zφdt = −σ − + + aΛσ − + σ + − − aΛσ +
ξ(Λ2 + 1)2 a 2 ξ(Λ2 + 1)2 a 2
EF t
+ 2 2
4Λ2 a2 + 2a2 C − 4σ+ σ−
(Λ + 1) a
E 2 e2ξt Λ Λa F 2 e−2ξt Λ Λa 4aEF t
= 2 2
1 − − 2 2
1 + + 2 2
Λ2 + C . (149)
(Λ + 1) ξ (Λ + 1) ξ (Λ + 1)
These equations are sufficient for us to find m and n. For m we have
R −Q2 e2ξt Λ 1 + Λa + e−2ξt Λ −1 + Λa + 4aQt Λ2 + C
uϑdt ξ ξ (150)
m= =
u (Λ2 + 1)2 (Qeξt + e−ξt )2
where Q has already be determined in Theorem 3.1. For n we have
R P 2 e2ξt Λ 1 − Λa − e−2ξt Λ 1 + Λa + 4aP t Λ2 + C
zφdt ξ ξ (151)
n= =
z (Λ2 + 1)2 (P eξt + e−ξt )2
where P has already be determined in Theorem 3.1. Finally, by using the solved m and n above and
r (Eq. (139)) in Eq. (137) and recovering the subscript s in all relevant terms, we obtain the complete
solution of self-attention up to the first order of ϵ under Assumption 3.1. Note that as t → ∞, we can
easily verify that in Eq. (129) m − n = −2Λ/(Λ2 + 1)2 and r = 4Λ/(Λ2 + 1)2 , thus g 1 − h1 = 0.
As a result, fs0 (t) + ϵs fs1 (t) = Λs as desired.
40