0% found this document useful (0 votes)
35 views77 pages

Six Lectures On NN - Montanari

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views77 pages

Six Lectures On NN - Montanari

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

arXiv:2308.13431v1 [stat.

ML] 25 Aug 2023

Six Lectures on
Linearized Neural Networks

Theodor Misiakiewicz1 and Andrea Montanari2

August 28, 2023

1
Department of Statistics, Stanford University
2
Department of Electrical Engineering and Department of Statistics, Stanford University
ii
Contents

1 Models and motivations 1


1.1 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The optimization question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 The linear regime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Linearization of two-layer networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Outline of this tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Linear regression under feature concentration assumptions 9


2.1 Setting and sharp characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Non-Gaussian covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 An example: Analysis of a latent space model . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Bounds and benign overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Kernel ridge regression in high dimension 19


3.1 Infinite-width limit and kernel ridge regression . . . . . . . . . . . . . . . . . . . . . 19
3.2 Test error of KRR in the polynomial high-dimensional regime . . . . . . . . . . . . . 23
3.3 Diagonalization of inner-product kernels on the sphere . . . . . . . . . . . . . . . . . 26
3.4 Proof sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 Random features 31
4.1 The random feature model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Test error in the polynomial high-dimensional regime . . . . . . . . . . . . . . . . . . 32
4.3 Double descent and proportional asymptotics . . . . . . . . . . . . . . . . . . . . . . 34
4.4 Polynomial asymptotics: Proof sketch . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Neural tangent features 39


5.1 Finite-width neural tangent model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Approximation by infinite-width KRR: Kernel matrix . . . . . . . . . . . . . . . . . 41
5.3 Approximation by infinite-width KRR: Test error . . . . . . . . . . . . . . . . . . . . 43
5.4 Proof sketch: Kernel concentration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6 Why stop being lazy (and how) 47


6.1 Lazy training fails on ridge functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.2 Non-lazy infinite-width networks (a.k.a. mean field) . . . . . . . . . . . . . . . . . . 49
6.3 Learning ridge functions in mean field: basics . . . . . . . . . . . . . . . . . . . . . . 52

iii
A Summary of notations 55

B Literature survey 57
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Acknowledgments
These lecture notes are based on two series of lectures given by Andrea Montanari, first at the “Deep
Learning Theory Summer School” at Princeton from July 27th to August 4th 2021, and then at
the summer school “Statistical Physics & Machine Learning”, that took place at Les Houches
School of Physics in France from 4th to 29th July 2022. Theodor Misiakiewicz was Teaching
Assistant to AM’s lectures at the “Deep Learning Theory Summer School”. We thank Boris Hanin
for organizing the summer school in Princeton, and Florent Krzakala and Lenka Zdeborová from
EPFL for organizing the summer school in Les Houches.
AM was supported by the NSF through award DMS-2031883, the Simons Foundation through
Award 814639 for the Collaboration on the Theoretical Foundations of Deep Learning, the NSF
grant CCF-2006489 and the ONR grant N00014-18-1-2729

iv
Chapter 1

Models and motivations

This tutorial examines what can be learnt about the behavior of multi-layer neural networks from
the analysis of linear models. While there are important gaps between neural networks and their
linear counterparts, many useful lessons can be learnt by studying the latter.
A few preliminary remarks, before diving into the math:
• We will not assume specific background in machine learning, let alone neural networks. On
the other hand, we will assume some graduate-level mathematics, in particular probability
theory (however, we will refer to the literature for complete proofs.)

• Some of the notations that are used throughout the text will be summarized in Appendix A.

• We will keep bibliographic references in the main text to a minimum. A short guide to the
literature is given in Appendix B.
This chapter is devoted to describing the correspondence between nonlinear and linear models
via the so-called neural tangent model.

1.1 Setting
We will focus on supervised learning. We are given data {(yi , xi )}i≤n ∼iid P where xi ∈ Rd is a
vector of covariates, yi ∈ R is a response or label, and P a probability distribution on R × Rd . We
denote the space of such probability distributions by P(R × Rd ).
We want to learn a model, that is a function f : Rd → R that, given a new covariate vector
xnew allows to predict the corresponding response ynew . The quality of a prediction is measured by
the test error R(f ; P) = E ℓ(ynew , f (xnew )) where ℓ : R × R → R is a loss function. We will focus
on the simplest choice for the latter, square loss. Namely

R(f ; P) = E (ynew − f (xnew ))2 ,



(ynew , xnew ) ∼ P . (1.1.1)

Depending on the context, we will modify the notation for the arguments of R. In particular,
we will be mostly interested in parametric models f (x) = f (x; θ), where θ ∈ Rp is a vector of
parameters (e.g. the network weights), and therefore the test error can be thought of as a function
of these parameters. With a slight abuse of notation, we might therefore write

R(θ; P) = E (ynew − f (xnew ; θ))2 .



(1.1.2)

1
Often we will drop the argument P or replace it by a proxy. Also, it is sometimes convenient to
subtract from the test error the minimum error achieved by any predictor f (also known as the
‘Bayes risk’). The result is the so called ‘excess risk’

Rexc (f ; P) = R(f ; P) − RB (P) (1.1.3)


:= E (ynew − f (xnew ))2 − E (ynew − f0 (xnew ))2 ,
 
inf
f0 :Rd →R

where the infimum is taken over all measurable functions. In the case of square loss treated here,
the Bayes risk is just the conditional variance: RB (P) = E{(y − E(y|x))2 }.
The main approach to learn the parametric model f ( · ; θ) consists in minimizing the empirical
risk
n
1X 2
R̂n (θ) := yi − f (xi ; θ) . (1.1.4)
n
i=1

Modern machine learning systems attempt to achieve approximate minimization of this objective
via first order methods. This term is used to refer to algorithms that access the cost function
R̂n (θ) only by obtaining its gradient ∇R̂n (θ i ) and value R̂n (θ i ) at query points θ 0 , . . . , θ k . The
next query point θ k+1 is computed from this information.
The simplest example is, of course, gradient descent (GD):

θ k+1 = θ k − εk S∇θ R̂n (θ k ) .

Here S ∈ Rp×p is a scaling matrix that allows us to choose different step sizes for different groups
of parameters. In practice stochastic gradient descent (SGD) is preferred for a number of reasons.
In its simplest implementation SGD takes a gradient step with respect to the loss incurred on a
single, randomly chosen, sample

θ k+1 = θ k + 2ε̃k yI(k) − f (xI(k) ; θ k ) S∇θ f (xI(k) ; θ k ) .




Both GD and SGD can sometimes be well approximated by gradient flow (GF) for the sake of
analysis. GF corresponds to the vanishing stepsize limit, and operates in continuous time

θ̇(t) = −S∇R̂n (θ(t)) .

Extra care should be paid —in general— when working with such continuous time dynamics, as
they do not necessarily correspond to practical algorithms. However in the case of GD and SGD
the correspondence is relatively straightforward: these algorithms are often well approximated by
GF for reasonable choices of the stepsize. (Namely, stepsize that is an inverse polynomial in the
dimension d.)
For future reference, it is useful to note that the empirical risk (1.1.4) only depends on the
model f (x; θ) through its evaluation at the n datapoints:
T
fn (θ) = f (x1 ; θ), f (x2 ; θ), . . . , f (xn ; θ) (1.1.5)

This define the evaluation map fn : Rp → Rn . The empirical risk can then be written as
1 2
R̂n (θ) := y − fn (θ) 2
, (1.1.6)
n

2
ERM0 := {θ : R̂n (θ) = 0}

θ0
θ̂

Rp

Figure 1.1

with y = (y1 , . . . , yn )T .
These lectures are mainly devoted to two-layer (one-hidden layer) networks. In this case the
parametric model reads
N
X
f (x; θ) = α ai σ(⟨wi , x⟩) , θ = (a1 , . . . , aN ; w1 , . . . , wN ) ∈ RN (d+1) , (1.1.7)
i=1

where the activation function σ : R → R is fixed. (The scaling factor α will be useful in the
following.)

1.2 The optimization question


For nonlinear models such as the two-layer network (1.1.7), the empirical risk R̂n (θ) of Eq. (1.1.4)
is highly non-convex. Despite this, GD or SGD (and their variants) appear to be able to optimize
the empirical risk of real neural networks to near global optimality.
This leads to the following

Optimization question: How is it possible that simple first order methods optimize the
empirical risk of neural networks to near global optimality?

Over the last few years, an hypothesis has emerged to explain this puzzle: tractability of
empirical risk minimization is due to the fact that the network is overparametrized. Let us briefly
describe informally this scenario, which is essentially conjectural at the moment. Since the number
of parameters p is larger than the sample size n, we expect there to be many global empirical risk
minimizers that achieve zero risk. We denote the set of such global minimizers by
n o
ERM0 = θ ∈ Rp : fn (θ) = y . (1.2.1)

These form a sub-manifold in Rp , see Fig. 1.1 for a cartoon. Points of this manifold correspond to
models that perfectly interpolate the data, i.e. ‘interpolators’.

3
For most initializations θ 0 , the manifold of interpolators ERM0 passes close to θ 0 . GD or SGD
converge quickly to a specific point on ERM0 which is close in a suitable sense.
As emphasized, this scenario is an hypothesis, and we do not know precise conditions under
which it holds. In fact, it is easy to construct counter-examples. For instance if the function f ( · ; θ)
depends on the parameter θ only through its first k coordinates θ1 , . . . , θk . Then there might not1
be a solution to the n equations y = fn (θ) for k < n (despite n ≤ p).
One setting in which the conditions for interpolation are better understood is the ‘linear’ regime
which we discuss next.

1.3 The linear regime


In certain cases, it can be proved that the weights do not change much during training, and it is
therefore accurate to approximate f ( · ; θ) by its first order Taylor expansion in θ 0 . Justifying this
approximation is not the main focus of these lectures which instead take it as a starting point and
derive some of its consequences.
We will outline nevertheless an explanation, deferring to the literature for a rigorous treatment
(see [DZPS18, COB19, BMR21] and Appendix B). We are looking for a solution to the interpolating
equation fn (θ) = y. By Taylor’s theorem, this yields
y − fn (θ 0 ) = fn (θ) − fn (θ 0 ) (1.3.1)
Z 1
= Dfn (θ 0 )(θ − θ 0 ) + (Dfn (θ t ) − Dfn (θ 0 ))(θ − θ 0 ) dt (1.3.2)
0
=: Φ(θ − θ 0 ) + e(θ) , (1.3.3)
where θ t := (1 − t)θ 0 + tθ and we denoted by Φ := Dfn (θ 0 ) ∈ Rn×p the Jacobian of the evaluation
map at the initialization θ 0 . Further
∥e(θ)∥2 ≤ Ln ∥θ − θ 0 ∥22 , (1.3.4)
∥Dfn (θ) − Dfn (θ 0 )∥op
Ln := sup . (1.3.5)
θ̸=θ 0 ∥θ − θ 0 ∥2
Neglecting the second order contribution in the above equation suggests a solution of the form (here
Φ+ denotes the pseudoinverse)
θ = θ 0 + Φ+ (y − fn (θ 0 ) + δ) , (1.3.6)
with the following equation for δ (writing ỹ := y − fn (θ 0 )):
δ = −e(θ 0 + Φ+ (ỹ + δ)) . (1.3.7)

Defining F (δ) := −e(θ 0 +Φ+ (ỹ+δ)), by Eq. (1.3.4) we have ∥F (δ)∥2 ≤ Ln (∥Φ† ỹ∥2 +∥Φ+ ∥op ∥δ∥2 )2 .
In other words, F maps the ball of radius r into the ball of radius Ln (∥Φ+ ỹ∥2 + ∥Φ+ ∥op r)2 . By
taking r = ∥Φ+ ỹ∥2 /∥Φ+ ∥op we obtain that, for
1
Ln ≤ , (1.3.8)
4∥Φ (y − fn (θ 0 ))∥2 ∥Φ+ ∥op
+

1
The situation is a bit more complicated: if all we are interested in is interpolation, then less than n parameters
can be sufficient for certain parametric classes f ( · ; θ). However, these solutions are typically fragile to noise and
hard to compute.

4
the ball of radius r is mapped into itself. Hence, by Brouwer’s fixed point theorem, there exists a
solution (an interpolator) of the form (1.3.6) with ∥δ∥2 < r.
Summarizing, under condition (1.3.8), there exists an interpolator that is well approximated by
replacing the nonlinear model f ( · ; θ) by its linearization

flin (x; θ) := f (x; θ 0 ) + ⟨θ − θ 0 , ∇θ f (x; θ 0 )⟩ . (1.3.9)

Indeed, for δ = 0, the expression in Eq. (1.3.6) coincides with the solution of the linearized equations

flin (xi ; θ) = yi , ∀i ≤ n , (1.3.10)

which minimizes the ℓ2 distance from initialization.


This suggests to define the following linearized empirical risk

1 2
R̂lin,n (θ) := y − flin,n (θ) 2 (1.3.11)
n
1 2
= ỹ − Φ(θ − θ 0 ) 2
, (1.3.12)
n
and the corresponding linearized test error:
 2
Rlin (θ) := E ynew − flin (xnew ; θ) } . (1.3.13)

Several papers prove that, under conditions analogous to (1.3.8), the original problem and the
linearized one are close to each other, see Appendix B. Informally, denoting by θ(t) the gradient
flow for R̂n , and by θ lin (t) the gradient flow for R̂lin,n , these proofs imply the following:

1. The empirical risk converges exponentially fast to 0. Namely, for all t, we have R̂n (θ(t)) ≤
R̂n (θ 0 ) exp(−λ0 t/2), where λ0 := ∥Φ+ ∥2op = σmin (Φ)2 .

2. The parameters of one model are tracked by the ones of the other model, i.e. ∥θ(t)−θ lin (t)∥2 ≪
∥θ(t) − θ 0 ∥2 for all t.

3. The linearized model is a good approximation for a fully nonlinear model. Namely, for all t

flin ( · ; θ lin (t)) − f ( · ; θ(t)) L2


≪ f ( · ; θ(t)) L2
. (1.3.14)

(Here ∥g∥L2 := E{g(x)2 }1/2 .)

From a statistical perspective, the most important point is the last one, cf. Eq. (1.3.14), since it
says that the model learnt within the linear theory is a good approximation of the original nonlinear
model at a random test point, rather that at a training point. In particular, by triangular inequality

R(θ(t))1/2 − Rlin (θ lin (t))1/2 ≪ E{f (x; θ(t))2 }1/2 . (1.3.15)

From here on, we will focus on such linearized models, and try to understand their generalization
error. Notice that the parameter vector θ lin (t) depends on time and hence in principle we would
like to study the whole function t 7→ Rlin (θ lin (t)). Of particular interest is the limit t → ∞ (which

5
corresponds to interpolation). It is an elementary fact that gradient flow with a quadratic cost
function converges to the minimizer that is closest to the initialization in ∥ · ∥S −1 norm:
n o
lim θ lin (t) = argmin ∥θ − θ 0 ∥S −1 : Dfn (θ 0 )(θ − θ 0 ) = y − fn (θ 0 ) (1.3.16)
t→∞
n o
= θ 0 + argmin ∥b∥S −1 : Dfn (θ 0 )b = y − fn (θ 0 ) , (1.3.17)

where ∥b∥S −1 = ∥S −1/2 b∥2 = ⟨b, S −1 b⟩1/2 .


Hereafter, we will typically drop the subscripts ‘lin’. Rather than studying the gradient flow
path, we will focus on a different path that also interpolates between the θ 0 and the min-ℓ2 distance
interpolator. Namely, we will consider ridge regression
n1 o
2
b̂(λ) := argminb∈Rp ỹ − Φb 2 + λ∥b∥2S −1 , (1.3.18)
n
where we recall that ỹ = y − fn (θ 0 ) and Φ = Dfn (θ 0 ).

1.4 Linearization of two-layer networks


As mentioned above, we will focus on the case of two-layer neural networks, cf. Eq. (1.1.7). We
will assume σ to be weakly differentiable with weak derivative σ ′ . A simple calculation yields
N
X N
X
flin (x; θ 0 + b) = f (x; θ 0 ) + α b1,i σ(⟨wi , x⟩) + α ai ⟨b2,i , x⟩σ ′ (⟨wi , x⟩) , (1.4.1)
i=1 i=1
N (d+1)
θ 0 := a1 , . . . , aN ; w1 , . . . , wN ) ∈ R , (1.4.2)
N (d+1)
b := b1,1 , . . . , b1,N ; b2,1 , . . . , b2,N ) ∈ R . (1.4.3)

We can rewrite this linear model in terms of the following featurization maps:
1
ϕRF (x) = √ [σ(⟨w1 , x⟩); . . . ; σ(⟨wN , x⟩)] , (1.4.4)
N
1
ϕNT (x) = √ [σ ′ (⟨w1 , x⟩)xT ; . . . ; σ ′ (⟨wN , x⟩)xT ] , (1.4.5)
Nd

We have ϕRF : Rd → RN and ϕNT : Rd → RN d . Setting for convenience α = 1/ N , we get

flin (x; θ 0 + b) = f (x; θ 0 ) + ⟨b1 , ϕRF (x)⟩ + d⟨b2 , ϕNT (x)⟩ . (1.4.6)

We define the design matrix


 
—ϕRF (x1 )— —ϕNT (x1 )—
 —ϕ (x2 )— —ϕ (x2 )— 
RF NT
Φ :=  , (1.4.7)
 
..
 . 
—ϕRF (xn )— —ϕNT (xn )—
and consider the stepsize scaling matrix

S = diag(1, . . . , 1; sd, . . . , sd) . (1.4.8)


| {z } | {z }
N Nd

6
With suitable redefinition of b = (b1 , b2 ) ∈ RN × RN d , the ridge regression problem thus reads
n1 o
2
b̂(λ) := argminb∈Rp ỹ − Φb 2
+ λRF ∥b1 ∥22 + λNT ∥b2 ∥22 , (1.4.9)
n
where λRF := λ, λNT := λ/s. We conclude this section with two remarks.

Remark 1.4.1. We saw that GD with respect to a quadratic cost function converges to the closest
minimizer to the initialization, where ‘closest’ is measured by a suitably weighted ℓ2 distance. This
is the simplest example of a more general phenomenon known as ‘implicit regularization’: when
learning overparametrized models, common optimization algorithms select empirical risk minimizers
that present special simplicity properties (smallness of certain norms). While this is normally
achieved by explicitly regularizing the risk to promote those properties, in overparametrized system
it is implicitly induced (to some extent) by the dynamics of the optimization algorithm. Examples
of this phenomenon include [GWB+ 17, GLSS18a, SHN+ 18, JT18b] (see also Appendix B).
Note that Eq. (1.4.9) also illustrates how the precise form of implicit regularization depends on
the optimization algorithm. We mentioned that gradient flow converges to the λ = 0+ solution
of Eq. (1.4.9), which corresponds to the interpolator that minimizes ∥b1 ∥22 + ∥b2 ∥22 /s. The precise
form of the norm that is implicitly regularized depends on the details of the optimization algorithm
(in this case the ratio of learning rates s).

Remark 1.4.2. The ridge regression problem of Eq. (1.4.9) presents another peculiarity. The
responses y have been replaced by the residuals at initialization ỹ = y − fn (θ 0 ). In general, this
fact should be taken into account when analyzing the linear system. However, it is possible to
construct initializations close to standard random initializations and yet have fn (θ 0 ) = 0. We will
therefore neglect the difference between y and ỹ.

1.5 Outline of this tutorial


The rest of this tutorial is organized as follows.

Chapter 2 studies ridge regression under a simpler model in which the feature vectors ϕ(x) ∈ Rp
are completely characterized by their mean (which we will assume to be zero) and covariance
Σ. Under suitable concentration assumption on the feature vector, the resulting risk is
universal and can be precisely characterized. Despite its simplicity, this model displays many
interesting phenomena.
In fact, while the more complex settings in the following chapters do not fit the required con-
centration assumptions, their behavior is correctly predicted by this simpler model, pointing
at a remarkable universality phenomenon.

Chapter 3 focuses on the infinite width (N → ∞) limit of the neural tangent model we derived
above. This is described by kernel ridge regression for a rotationally invariant kernel. We
derive the generalization behavior of KRR in high dimension.

Chapter 4 studies ‘random feature models,’ which correspond to setting b2 = 0 in (1.4.9), i.e.
fitting only the second-layer weights. This clarifies what happens when moving from an
infinitely wide networks to finite width.

7
Chapter 5 considers the other limit case in which we set b1 = 0 and only learn the first layer
weights (in the linear approximation).
It is not hard to see that the case in which both b1 , b2 are fit to data is very close to the one
in which b1 = 0 and b2 is fit to data. Therefore this case yields the correct insights into the
generalization behavior of finite width neural tangent models.

Chapter 6 finally discusses the limitations of linear theory. In particular, we discuss some simple
examples in which we need to go beyond the neural tangent theory to capture the behavior
of actual neural networks trained via gradient descent.

8
Chapter 2

Linear regression under feature


concentration assumptions

In this section we consider ridge regression and its limit for vanishing regularization (minimum ℓ2 -
norm regression), focusing on the overparametrized regime p > n. Our objective is to understand
if and when interpolation or overfitting is compatible with good generalization. With this in mind,
we start by considering a simple model in which the feature vectors are completely characterized by
their covariance Σ. With little loss of generality, we will assume the vectors to be centered. Under
certain concentration assumptions on these vectors, a sharp characterization of the prediction
risk can be derived. This theoretical prediction captures in a precise way numerous interesting
phenomena such as benign overfitting and double descent.

2.1 Setting and sharp characterization


In this chapter we assume to be given i.i.d. samples {(yi , z i )}i≤n where responses are given by
yi = ⟨β ∗ , z i ⟩ + wi , E(wi |z i ) = 0 , E(wi2 |z i ) = τ 2 . (2.1.1)
We assume E(z i ) = 0, E(z i z T
i ) = Σ. (The case of non-zero E(z i ) ̸= 0 could be treated at the cost
of some notational burden, but does not present conceptual novelties.)
While we will state results that hold for a broad class of non-Gaussian vectors, for pedagogical
reasons we will begin with the simplest example:
z i ∼ N(0, Σ) ⊥ wi ∼ N(0, τ 2 ) . (2.1.2)
The covariance Σ ∈ Rp×p and coefficient vector β ∗ ∈ Rp are unknown, alongside the noise level
τ . We estimate β ∗ using ridge regression:
n1 o
2
β̂(λ) := arg minp y − Zb 2 + λ∥b∥22 , (2.1.3)
b∈R n
where Z ∈ Rn×p is the matrix whose i-th row is z i and y ∈ Rn is the vector whose i-th entry is yi .
Remark 2.1.1. We emphasize that the above definitions make perfect sense if p = ∞. In this
case Rp is interpreted to be the Hilbert space of square summable sequences ℓ2 and ⟨u, v⟩ = uT v
is the scalar product in ℓ2 . In fact, unless specified otherwise, the results below cover this infinite-
dimensional case.

9
We are interested in the excess test error. With a slight abuse of notation, we can use the
notation Rexc (fˆ) = Rexc (λ; Z, β 0 , w) since fˆ(x) = ⟨β̂(λ), x⟩ is a function of λ, Z, β 0 , w. We have:

⟨β̂(λ), z new ⟩ − ⟨β ∗ , z new ⟩)2



Rexc (λ; Z, β 0 , w) := Eznew (2.1.4)
= ∥β̂(λ) − β ∗ ∥2Σ . (2.1.5)

Recall that in Chapter 1, Eq. (1.1.3), we defined the excess test error to be the difference between
test error and Bayes error. In this case the Bayes error is equal to τ 2 , and therefore the relation is
particularly simple: R(f ) = Rexc (f ) + τ 2 (reverting to using the f to denote the argument).
We also note that the ridge regression estimator can be written explicitly as

1 1 −1
T T
β̂(λ) = S λ Z y , S λ = Z Z + λI p , (2.1.6)
n n
whence we obtain the expression:

1 2
Rexc (λ; Z, β ∗ , w) := λS λ β ∗ − SλZ Tw . (2.1.7)
n Σ

Note that the risk (2.1.5) depends on the noise vector w. It is useful to define its expectation with
respect to w, which can be exactly decomposed in a bias term and a variance term:

R̄(λ; Z, β ∗ ) := Ew Rexc (λ; Z, β ∗ , w) (2.1.8)


= B(λ; Z, β ∗ ) + V(λ; Z) . (2.1.9)

The bias and variance are given by

B(λ; Z, β ∗ ) := λ2 ⟨β ∗ , S λ ΣS λ β ∗ ⟩ , (2.1.10)
τ2  1 
V(λ; Z) := Tr S 2λ Z T ZΣ . (2.1.11)
n n
Of course these formulas do not provide always simple insights into the qualitative behavior
of the test error. In particular, B(λ; Z, β ∗ ) and V(λ; Z) are random quantities because of the
randomness in Z. The next theorem shows that these quantities concentrate around deterministic
predictions that depend uniquely on the geometry of (Σ, β ∗ ).
Before stating this characterization, we introduce some important notions.

Definition 2.1.1 (Effective dimension). We say that Σ has effective dimension dΣ (n) if, for all
1 ≤ k ≤ min{n, p},
p
X
σl ≤ dΣ σk .
l=k

Without loss of generality, we will always choose dΣ (n) ≥ n.

Definition 2.1.2 (Bounded varying spectrum). We say that Σ has bounded varying spectrum if
there exists a monotone decreasing function ψ : (0, 1] → [1, ∞) with limδ↓0 ψ(δ) = ∞, such that
σ⌊δi⌋ /σi ≤ ψ(δ) for all δ ∈ (0, 1], i ∈ N and δi ≥ 1.

10
Given Σ, λ, let λ∗ (λ) ≥ 0 be the unique positive solution of
 λ  
n 1− = Tr Σ(Σ + λ∗ I)−1 . (2.1.12)
λ∗
(with λ∗ = 0 if λ = 0 and n ≥ p.) Define B(Σ, β ∗ ) and V (Σ) by

λ2∗ ⟨β ∗ , (Σ + λ∗ I)−2 Σβ ∗ ⟩
B(Σ, β ∗ ) := , (2.1.13)
1 − n−1 Tr Σ2 (Σ + λ∗ I)−2
τ 2 Tr Σ2 (Σ + λ∗ I)−2

V (Σ) := . (2.1.14)
n − Tr Σ2 (Σ + λ∗ I)−2

Let us emphasize that B(Σ, β ∗ ), V (Σ) are deterministic quantities.


The next theorem gives sufficient conditions under which B(Σ, β ∗ ), V (Σ) are accurare multi-
plicative approximations of the actual bias and variance.

Theorem 1. Assume ∥Σ∥op = 1, a setting which we can always reduce to by rescaling Z. Further
assume ∥Σ−1/2 β ∗ ∥ < ∞, and that one of the following scenarios holds:

1. Proportional regime. There exists a constant M < ∞ such that p/n ∈ [1/M, M ], σp ≥ 1/M ,
λ ∈ [1/M, M ].

2. Dimension-free regime. Σ has effective dimension dΣ (n), bounded-varying spectrum, and


there exist constants M > 0 and γ ∈ (0, 1/3) such that the following hold. Define

⟨β ∗ , Σ(λ∗ I + Σ)−1 β ∗ ⟩
ρ(λ) := . (2.1.15)
∥β ∗ ∥2 Tr(Σ(λ∗ I + Σ)−1 )

Then we have λ∗ (0) ≤ M , λ ∈ [λ∗ (0)/M, λ∗ (0)M ] and dΣ (n) ≤ ρ(λ)1/6 n1+γ .

Then, there exists a constant C0 (depending uniquely on the constants in the assumptions), such
that for n ≥ C0 , the following holds with probability at least 1 − n−10 . We have

B(λ; Z, β ∗ ) = 1 + errB (n) · B(Σ, β ∗ ) ,



(2.1.16)
V(λ; Z) = 1 + errV (n) · V (Σ) ,

(2.1.17)

where, under the proportional regime (scenario 1 above), we have |errB (n)| ≤ n−0.49 , |errV (n)| ≤
n−0.99 , while, in the dimension-free regime (scenario 2 above), |errB (n)| ≤ (dΣ (n)/n)3 /(ρ(λ)1/2 n0.99 ),
|errV (n)| ≤ (dΣ (n)/n)3 /n0.99 .

Remark 2.1.2. Defining FΣ (x) := n−1 Tr Σ(Σ + xI)−1 , Eq. (2.1.12) reads:


λ
1− = FΣ (λ∗ ) . (2.1.18)
λ∗
For λ > 0, this equation has always a unique solution by monotonicity.
For λ = 0, the left hand side is constant and equal to 1. The right-hand side is strictly monotone
decreasing with FΣ (0) = p/n and limx→∞ FΣ (x) = 0. Hence for p/n > 1 (overparametrized
regime), the equation has a unique solution λ∗ > 0.
For p/n ≤ 1 (underparametrized regime), we set λ∗ = 0 by definition.

11
Remark 2.1.3 (Ridgeless limit). By virtue of the previous remark, the predicted bias and variance
B(Σ, β ∗ ) and V (Σ), make perfect sense for the case λ = 0. Indeed, Theorem 1 holds for λ = 0+
as well, although this requires an additional argument and possibly larger error terms.
Theorem 1 has a simple interpretation in terms of a simpler sequence model, which we next
define. In the sequence model we want to estimate β ∗ from observation y s given by
ω
y s = Σ1/2 β ∗ + √ ε , ε ∼ N(0, I p ) . (2.1.19)
n
We use ridge regression at regularization level λ∗ as defined in Eq. (2.1.12):
s
n o
2
β̂ (λ∗ ) := argminb∈Rp y s − Σ1/2 b 2 + λ∗ ∥b∥22 . (2.1.20)

Then the prediction for the risk of the original model, namely B(Σ, β ∗ ) + V (Σ), coincides with
the risk of the sequence model, provided we choose ω to be the unique positive solution of

τ2  s
ω2 = + Eε ∥β̂ (λ∗ ) − β ∗ ∥2Σ . (2.1.21)
n
The solution of this equation is easy to express in terms of quantities appearing in the theorem
statement
τ 2 + λ2∗ ⟨β ∗ , (Σ + λ∗ I)−2 Σβ ∗ ⟩
ω2 =  .
1 − n−1 Tr Σ2 (Σ + λ∗ I)−2

To summarize we have the following correspondence:

Gaussian feature model Sequence model


Design matrix Random design matrix Z Deterministic design Σ1/2
Ridge penalty λ λ∗ > λ
Noise variance τ2 ω2 > τ 2

Note in particular, as pointed out above (Remark 2.1.2), in the overparametrized regime we
have λ∗ > 0 even if λ = 0. We refer to this phenomenon as ‘self-induced regularization’ : the
noisy feature vectors in the original unregularized problem induce an effective regularization in the
equivalent sequence model.
Self-induced regularization is a key mechanism by which an interpolating model can generalize
well. Roughly speaking, although λ = 0+, the model behaves as if the sample covariance was
replaced by the population one, but the regularization was increased from 0 to λ∗ . The noise in
the covariates acts as a regularizer.
In order to make this correspondence even more concrete, we note that
1  2  1  
Tr Σ (Σ + λ∗ I)−2 < Tr Σ(Σ + λ∗ I)−1 < 1 (2.1.22)
n n
Let us assume that this inequality holds uniformly. Namely, there exists a constant c1 < ∞ such
that
1  2  1
Tr Σ (Σ + λ∗ I)−2 ≤ 1 − . (2.1.23)
n c1

12
Substituting in Eqs. (2.1.14), (2.1.13), we get
c1 τ 2
V (Σ) ≤ Tr Σ2 (Σ + λ∗ I)−2 ,

(2.1.24)
n
B(Σ, β ∗ ) ≤ c1 λ2∗ ⟨β ∗ , (Σ + λ∗ I)−2 Σβ ∗ ⟩ . (2.1.25)

Notice that the expressions on the right-hand side (up to the constant c1 ) are the bias and variance
of the sequence model with ω 2 = τ 2 . In other words, in some cases of interest, simply considering
the sequence model with ω 2 = τ 2 allows to bound (up to constants) the risk of the original problem.

2.2 Non-Gaussian covariates


Theorem 1 holds beyond Gaussian covariates, and was proven under the following assumptions on
the covariates.
Let z i := Σ−1/2 xi , so that z i is isotropic, namely E{z i } = 0, E{z i z T
i } = I. We then consider
the following two models for z i , depending on a constant κx > 0:
(a) Independent sub-Gaussian coordinates: z i has independent but not necessarily iden-
tically distributed coordinates with uniformly bounded sub-Gaussian norm. Namely: each
1 1
coordinate zij of z i satifies E[zij ] = 0, Var(zij ) = 1 and ∥zij ∥ψ2 := supp≥1 p− 2 E{|zij |p } p ≤ κx .
(b) Convex concentration: allowing z i to have dependent coordinates, the following holds for
any 1−Lipschitz convex function φ : Rp → R, and for every t > 0
2 2
P |φ(z i ) − Eφ(z i )| ≥ t ≤ 2 e−t /κx .


For independent sub-Gaussian coordinates, a version of Theorem 1 was proven in [HMRT22],


although with larger error terms than stated here. The form stated here, the infinite-dimensional
case, with regularly varying spectrum, and the case of covariates satisfying convex concentration
were proven in [CM22].

2.3 An example: Analysis of a latent space model


As an application of the general theory in Section 2.1, it is instructive to consider the following
latent space model. We assume the response y to be linear in an underlying covariate vector x ∈ Rd .
However, we fit a model in the features z ∈ Rp , which are also linear in x:

yi = ⟨θ ∗ , xi ⟩ + ξi , ξi ∼ N(0, τ 2 ) , (2.3.1)
z i = W xi + ui , ui ∼ N(0, I p ) , (2.3.2)

where the matrix W is fixed (at random). We perform ridge regression of yi on z i as in Eq. (2.1.3)
(with Z the matrix whose i-th row is given by z i ).
We assume a simple model for the latent features, namely xi ∼ N(0, I d ), and W proportional
to an orthogonal matrix. Namely W T W = (pµ/d)I d . We consider the proportional asymptotics
p, n, d → ∞ with
p d
→ γ ∈ (0, ∞) , → ψ ∈ (0, 1) , (2.3.3)
n p

13
log( )=0
101 n = 100 log( ) = -1
n = 200 100
log(
log(
) = -2
) = -3
n = 400 log( ) = -4
log( ) = -5
log( ) = -6
100 log( ) = -7
log( ) = -8
10 1
log( ) = -9

R
R

log( )=
10 1

10 2

10 2

10 3
100 101 100 101

Figure 2.1: Test error of ridge regression under the latent space model. Left: circles are empirical results
and curves are theoretical predictions within the proportional asymptotics, in the ridgeless limit λ = 0+.
Here d = 20, τ = 0, rθ = 1, µ = 1. Different curves correspond to different sample sizes. Right: curves
correspond to different values of the regularization parameter. Here γψ = 1/20, τ = 0, rθ = 1, µ = 1.
Figure from [HMRT22].

and ∥θ ∗ ∥ → rθ . In particular, γ has the interpretation of number of parameters per sample, and
γ > 1 corresponds to the overparametrized regime.
Explicit formulas for the asymptotic risk can be obtained in this limit using Theorem 1. In
particular, in the minimum norm limit λ → 0, after a tedious but straightforward calculation, one
obtains
E1 (ψ, γ) µψ −1 rθ2
 
Blat (ψ, γ) := 1 + γc0 · , (2.3.4)
E2 (ψ, γ) (1 + µψ −1 )(1 + c0 γ(1 + µψ −1 ))2
E1 (ψ, γ)
Vlat (ψ, γ) := σ 2 γc0 , (2.3.5)
E2 (ψ, γ)
1−ψ ψ(1 + µψ −1 )2
E1 (ψ, γ) := + , (2.3.6)
(1 + c0 γ)2 (1 + c0 (1 + µψ −1 )γ)2
1−ψ ψ(1 + µψ −1 )
E2 (ψ, γ) := + . (2.3.7)
(1 + c0 γ)2 (1 + c0 (1 + µψ −1 )γ)2

where σ 2 = τ 2 + rθ2 /(1 + µψ −1 ), and c0 = c0 (ψ, γ) ≥ 0 is the unique non-negative solution of the
following second order equation

1 1−ψ ψ
1− = + . (2.3.8)
γ 1 + c0 γ 1 + c0 (1 + µψ −1 )γ

In Figure 2.1, left frame, we plot theoretical predictions and empirical results for the test error
of minimum norm interpolation, as a function of the overparametrization ratio. A few features of
this plot are striking:

1. A large spike is present at the interpolation threshold γ = 1: the test error grows when
the model becomes overparametrized but then descends again at large overparametrization.

14
This behavior has been dubbed ‘double descent’ in [BHMM19] (see Appendix B for further
discussions).

2. The minimum test error is achieved at large overparametrization λ ≫ 1.

3. In particular overparametrized models behave better than underparametrized ones. Intu-


itively, the reason is that the latent space can be identified better when the dimension of z i
grows.

In Figure 2.1, right frame, we plot theoretical predictions for the test error of ridge regression
as a function of the overparametrization ratio γ. The number of samples per latent dimension n/d
is kept fixed to (ψγ)−1 = 20. Different curves correspond to different values of λ. The test error
with optimal lambda is given by the lower envelope of these curves.
Two remarkable phenomena can be observed in these plots:

1. The spike at γ = 1 is smoothed for positive λ and disappears completely for the optimal
regularizations. In other words, the double descent is not an intrinsic statistical phenomenon,
and is instead due to under-regularization.

2. Nevertheless, at large overparametrization, the optimal regularization is λ = 0+.

The last point is particularly surprising: at large overparametrization (and large signal-to-noise
ratio) the optimal test error is achieved by an interpolator. The same phenomenon survives at
positive τ , and can be traced back to the fact that the effective regularization λ∗ is strictly positive
despite λ = 0+.

2.4 Bounds and benign overfitting


For a general Σ, β ∗ , the characterization given by Eqs. (2.1.12), (2.1.13), (2.1.14) may still be too
detailed to provide a simple intuition. In this section we derive upper bounds on bias and variance
that expose a particularly interesting phenomenon, which was first established in [BLLT20, TB20]
using a technically different approach: benign overfitting.
We work under the simplifying assumption

Tr Σ2 (Σ + λ∗ I)−2 ≤ n(1 − c−1



∗ ), (2.4.1)

:= i≥1 σi v i v T
P
for a constant c∗ ∈ (1, ∞). Let Σ P i be the eigendecomposition of Σ, with σ1 ≥
σ2 ≥ · · · ≥ 0. Denote by β ∗,≤k := i≤k ⟨β ∗ , v i ⟩v i the orthogonal projection of β onto the span
of the top k eigenvectors v 1 , . . . , v k , and by β ∗,>k := β ∗ − β ∗,≤k its complement. Finally, let
k∗ := max{k : σk ≥ λ∗ }.
The above definitions are quite natural in view of the latent space model discussed in the
previous section. We want to separate the contribution of “signal” directions in covariate space
(corresponding to the span of (v i : i ≤ k∗ )), from the contribution due to “junk” covariates (the
projection onto the span of (v i : i > k∗ )). Technically, we bound all traces in Eqs. (2.1.12), (2.1.13),
(2.1.14) after splitting the contributions of each two subspaces.

15
Consider Eq. (2.1.12), which implies n ≥ Tr{Σ(Σ + λ∗ I)−1 }. We get
k∗ p
X σℓ X σℓ
n≥ +
σℓ + λ∗ σℓ + λ∗
ℓ=1 ℓ=k∗ +1
k∗ p
X σℓ X σℓ
≥ +
2σℓ 2λ∗
ℓ=1 ℓ=k∗ +1
k∗ r1 (k∗ )
≥ + ,
2 2bk∗
where we defined bk := σk /σk+1 and
X  σℓ q
rq (k) := . (2.4.2)
σk+1
ℓ>k

Also note that the reverse inequality n ≤ 2k∗ + 2r1 (k∗ ) can be proved along the same line, as long
as λ ≤ λ∗ /2.
In many situations of interest eigenvalues become less spaced as k → ∞, and therefore bk is
bounded by a constant. For instance, this is the case if σk ≍ k −α+ok (1) . Further rq (k) can be
regarded as a measure of the number of eigenvalues of the same order as σk+1 , and is therefore an
‘effective rank’ at level k.
Next we bound V (Σ). Recalling that Tr Σ2 (Σ + λ∗ I)−2 ≤ n(1 − c−1

∗ ), we have
 
k∗ p
c∗ τ 2 X σℓ2 X σℓ2
V (Σ) ≤ · + 
n (σℓ + λ∗ )2 (σℓ + λ∗ )2
ℓ=1 ℓ=k∗ +1
 
d
c∗ τ 2  X σl2 
≤ · k∗ +
n λ2∗
l=k∗ +1
k r2 (k∗ ) 

≤ c∗ τ 2 +
n n
k 4b 2 n

≤ c∗ τ 2 + k∗ ,
n r(k∗ )
where in the last step we defined
r1 (k)2
r(k) := , (2.4.3)
r2 (k)
and used the bound (derived above) n ≥ r1 (k∗ )/(2bk∗ ).
We proceed similarly for the bias, namely:
d
X λ2∗ σℓ
B(Σ, β ∗ ) ≤ c∗ ⟨β , v ℓ ⟩2
(σℓ + λ∗ )2 ∗
ℓ=1
 
Xk∗ d
X
≤ c∗  λ2∗ σℓ−1 ⟨β ∗ , v ℓ ⟩2 + σℓ ⟨β ∗ , v ℓ ⟩2 
ℓ=1 ℓ=k∗ +1
 
≤ c∗ σk2∗ ∥β ∗,≤k∗ ∥2Σ−1 + ∥β ∗,>k∗ ∥2Σ .

16
We summarize these bounds in the statement below.
Proposition 2.4.1. Define rq (k) via Eq. (2.4.2), r(k) via Eq. (2.4.3), k∗ := max{k : σk ≥ λ∗ },
and bk := σk /σk+1 .
Under condition (2.4.1), we have n ≥ k∗ /2+r1 (k∗ )/(2bk∗ ) and (for λ ≤ λ∗ /2) n ≤ 2k∗ +2r1 (k∗ ).
Further

2 k∗
 r2 (k∗ )  2 k∗
 4b2k∗ n 
V (Σ) ≤ c∗ τ + ≤ c∗ τ + , (2.4.4)
n n n r(k∗ )
 
B(Σ, β ∗ ) ≤ c∗ σk2∗ ∥β ∗,≤k∗ ∥2Σ−1 + ∥β ∗,>k∗ ∥2Σ . (2.4.5)

As mentioned above, bounds of this form1 were first proven in [BLLT20, TB20]. These bounds
have an interesting consequence. They allow us to characterize pairs Σ, β ∗ (with p = ∞) for which
min-norm interpolation (the λ = 0+ limit of ridge regression) is ‘consistent’ even if τ > 0. In other
words, two things are happening at the same time:
1. The fitted model perfectly interpolates the train data (the train error vanishes).

2. The excess test error vanishes as n → ∞. (In statistics language, model is consistent.)
When these two elements occur together, we speak of benign overfitting.
Proposition 2.4.1 allows to determine sufficient conditions for benign overfitting. We will assume
for simplicity σk → 0 as k → ∞ and bk bounded.
Note that the condition n ≤ 2k∗ + 2r1 (k∗ ) implies k∗ → ∞ as n → ∞. Hence, for the bias
to vanish it is sufficient that ∥β∥Σ−1 < ∞. Indeed, this implies B(Σ, β ∗ ) ≤ c∗ (σk2∗ ∥β ∗ ∥2Σ−1 +
σk2∗ +1 ∥β ∗ ∥2Σ−1 ) → 0. Summarizing

∥β ∗ ∥2Σ−1 < ∞ ⇒ B(Σ, β ∗ ) → 0 . (2.4.6)

For instance, if ⟨β ∗ , v k ⟩ =
̸ 0 only for finitely many k, then this condition is obviously satisfied. More
generally, it conveys the intuition that β ∗ should be mostly aligned with the top eigendirections of
Σ. The number of eigendirections one should take into account diverges with n.
Next consider the variance V (Σ). Clearly, in order for the bound in Eq. (2.4.4) to vanish, the
following conditions must be satisfied:
k∗ r(k∗ )
→ 0, → ∞. (2.4.7)
n n
In order to get some intuition, let us start by considering the case of polynomially decaying eigen-
values σk ≍ k −α . We need to take α > 1 to ensure ∥xi ∥2 < ∞ almost surely. We then have, for
q ≥ 1,
X  k qα
rq (k) ≍ ≍ k,

ℓ>k

whence

k∗ (n) ≍ n , r(k∗ ) ≍ n .
1
There are some technical differences between the present statement and the results of [BLLT20, TB20]. We refer
to [CM22] for a discussion of the differences.

17
Hence the conditions (2.4.7) for the vanishing of variance do not hold in this case.
We insist, and consider a more slowly decaying sequence of eigenvalues σk ≍ k −1 (log k)−β ,
β > 1. In this case
X 1
rq (k) ≍ k q (log k)qβ
ℓq (log ℓ)qβ
ℓ>k
Z ∞
1
≍ k q (log k)qβ dx
k xq (log x)qβ

e−(q−1)t
Z
q qβ
≍ k (log k) dt .
log k tqβ

Therefore
(
k log k if q = 1,
rq (k) ≍
k if q > 1,

whence r(k) ≍ k(log k)2 . Using the bounds for k∗ (n) in Proposition 2.4.1, we conclude that
n
k∗ (n) ≍ , r(k∗ ) ≍ n log n .
log n

The conditions of Eq. (2.4.7) are therefore satisfied in this case.

18
Chapter 3

Kernel ridge regression in high


dimension

In this chapter we consider linearized neural networks in the infinite-width limit N → ∞. There
are two reasons for beginning our analysis from this limit case:
1. In the N → ∞ limit, the ridge regression problem (1.4.9) simplifies to kernel ridge regression
(KRR) with respect to an inner-product kernel. On one hand, this is a simpler problem than
the original one. On the other KRR is an interesting method for its own sake.
2. As we will see in the next two chapters, mildly overparametrized networks in the linear regime
behave similarly to their N = ∞ limit.
The specific kernel arising by taking the wide limit of neural networks is commonly referred to as
neural tangent kernel (NTK) [JGH18]. For the sake of clarity we will refer to it as the infinite width
NTK. In the case of fully connected networks (for any constant depth), there is hardly anything
special about the NTK. As we will see, the analysis can be carried out in a unified fashion for any
inner product kernel and the behavior is qualitatively independent of the specific kernel under some
genericity assumptions.
The goal of this chapter is to obtain a tight characterization of the test error of KRR in high
dimension.

3.1 Infinite-width limit and kernel ridge regression


Recall from Section 1.4 that we are interested in the ridge regression estimator
n o
2
b̂(λ) = argminb∈Rp ỹ − Φb 2 + λ∥b∥22 , (3.1.1)
T
where Φ = ϕ(x1 )T , . . . , ϕ(xn )T ∈ Rn×p is the matrix containing the feature vectors, and ϕ :


Rd → Rp is a featurization map. We are particularly interested in the featurization map obtained by


linearizing a two-layer neural network, as in Eq. (1.4.6) (in this case p = N (d + 1)). For simplicity,
we will assume f (x; θ 0 ) = 0 and replace ỹ by y (cf. Remark 1.4.2).
The minimizer of Problem (3.1.1) can be written explicitly as
−1 −1
b̂(λ) = λI p + ΦT Φ ΦT y = ΦT λI n + ΦΦT y. (3.1.2)

19
The prediction function is given by
−1
fˆλ (x) = ⟨b̂(λ), ϕ(x)⟩ = ϕ(x)T ΦT λI n + ΦΦT y. (3.1.3)

Let us introduce the kernel function K : Rd × Rd → R defined by K(x1 , x2 ) = ⟨ϕ(x1 ), ϕ(x2 )⟩,
where ⟨u, v⟩ = u1 v1 + . . . + up vp denotes the standard euclidean inner-product on Rp . We further
define K n = (K(xi , xj ))1≤i,j≤n = ΦΦT ∈ Rn×n the empirical kernel matrix evaluated at the n
data points. We can rewrite the prediction function (3.1.3) as
−1
fˆλ (x) = K(x, ·)T λI n + K n y, (3.1.4)
T
where we denoted K(x, ·) = K(x, x1 ), . . . , K(x, xn ) ∈ Rn . Equivalently, the prediction func-
tion (3.1.4) corresponds to the kernel ridge regression estimator with kernel K and regularization
parameter λ (see Remark 3.1.2 below). In other words, performing ridge regression on a linear
model with featurization map ϕ is equivalent to performing kernel ridge regression with kernel
K( · , · ) = ⟨ϕ( · ), ϕ( · )⟩.

Remark 3.1.1 (Reproducing kernel Hilbert space). In general, consider a feature map ϕ : Rd → F
where F is a Hilbert space, often called the ‘feature space’, with inner product ⟨ · , · ⟩F and norm
1/2
∥ · ∥F = ⟨ · , · ⟩F . Introduce the function space

H := {h(·) = ⟨θ, ϕ( · )⟩F : θ ∈ F, ∥θ∥F < ∞} . (3.1.5)

Then H is a reproducing kernel Hilbert space (RKHS) with reproducing kernel given by K(x1 , x2 ) =
⟨ϕ(x1 ), ϕ(x2 )⟩F and RKHS norm

∥h∥H = inf{∥θ∥F : θ ∈ F, h( · ) = ⟨θ, ϕ( · )⟩F } . (3.1.6)

(Conversely, any RKHS with reproducing kernel K induces a featurization map, e.g. taking F = H
and ϕ(x) = K(x, · ).)
In the simple case of Eq. (3.1.1), ϕ : Rd → Rp , F = Rp and the RKHS is simply the finite-
dimensional set of linear functions h(x) = ⟨b, ϕ(x)⟩ with b ∈ Rp and ∥b∥2 < ∞. However, in
general, we can take F to be infinite-dimensional. The RKHS framework is particularly useful be-
cause of this flexibility (see next remark). We refer the reader to [BTA11] for a general introduction
to the theory of RKHS and kernel methods.

Remark 3.1.2 (Kernel ridge regression). Kernel ridge regression is a general approach to learning
that abstracts the specific examples of ridge regression that we studied so far. In a first step, we
map the data x 7→ ϕ(x) into a feature space (F, ⟨ · , · ⟩F ). We then fit a low-norm linear predictor
with respect to this embedding, i.e. fˆλ (x) = ⟨θ̂(λ), ϕ(x)⟩F where
n
nX o
2
θ̂(λ) := argminθ∈F yi − ⟨θ, ϕ(xi )⟩F + λ∥θ∥2F .
i=1

From Remark 3.1.1, this is equivalent to the following:


n
nX o
2
fˆλ = argminf ∈H yi − f (xi ) + λ∥f ∥2H , (3.1.7)
i=1

20
where H is the RKHS associated to the feature map ϕ. By the Representer Theorem, the solution
(3.1.7) is given explicitly by fˆλ (x) = â1 (λ)K(x, x1 )+. . .+ân (λ)K(x, xn ) where K is the reproducing
kernel of H and
n o
â(λ) = argmina∈Rn ∥y − Ka∥22 + λaT Ka = (λI n + K)−1 y ,

with K = (K(xi , xj ))i,j∈[n] the kernel matrix. This is indeed the solution found in Eq. (3.1.4).
Note the following important observation: we do not need to evaluate the (potentially infinite-
dimensional) feature maps ϕ(xi ) but only their inner-products K(xi , xj ) = ⟨ϕ(xi ), ϕ(xj )⟩F which
can often be done efficiently. This is known as the kernel trick.

 In our case, the feature


N (d+1)
map is induced by the linearizion of a two-layer neural network, ϕ(x) =
ϕRF (x), ϕNT (x) ∈ R , where we recall that

1
ϕRF (x) = √ [σ(⟨w1 , x⟩); . . . ; σ(⟨wN , x⟩)] , (3.1.8)
N
1
ϕNT (x) = √ [σ ′ (⟨w1 , x⟩)xT ; . . . ; σ ′ (⟨wN , x⟩)xT ] . (3.1.9)
Nd
The associated kernel is given by
RF NT
KN (x1 , x2 ) = ⟨ϕ(x1 ), ϕ(x2 )⟩ = KN (x1 , x2 ) + KN (x1 , x2 ) , (3.1.10)

where
N
RF 1 X
KN (x1 , x2 ) := ⟨ϕRF (x1 ), ϕRF (x2 )⟩ = σ(⟨wi , x1 ⟩)σ(⟨wi , x2 ⟩) , (3.1.11)
N
i=1
N
1 X
NT
KN (x1 , x2 ) := ⟨ϕNT (x1 ), ϕNT (x2 )⟩ = ⟨x1 , x2 ⟩σ ′ (⟨wi , x1 ⟩)σ ′ (⟨wi , x2 ⟩) . (3.1.12)
Nd
i=1

These kernels are random because of the random weights wi , and are finite-dimensional, with rank
at most N and N d respectively.
We draw w1 , . . . , wN i.i.d. from a common distribution ν on Rd . As the number of neurons
goes to infinity, both kernels converge pointwise to their expectations by law of large numbers
RF
KN (x1 , x2 ) → K RF (x1 , x2 ) , NT
KN (x1 , x2 ) → K NT (x1 , x2 ) , (3.1.13)

where (with w ∼ ν)
  
K RF (x1 , x2 ) = Ew σ ⟨w, x1 ⟩ σ ⟨w, x2 ⟩ , (3.1.14)
1
K NT (x1 , x2 ) = ⟨x1 , x2 ⟩Ew σ ′ ⟨w, x1 ⟩ σ ′ ⟨w, x2 ⟩ .
  
(3.1.15)
d
Let us emphasize that the pointwise convergence of Eqs. (3.1.13) does not provide any quantitative
control on how large N has to be for finite-N ridge regression to behave similarly to N = ∞ ridge
regression. This question will be addressed in the next two chapters.
We will consider either w ∼ Unif(Sd−1 ) (with Sd−1 = {w ∈ Rd : ∥w∥2 = 1} the unit sphere
in d dimensions) or w ∼ N(0, I d /d). These correspond to standard initializations used in neural

21
networks and both are very close to each other for d ≫ 1. Since the distribution of w is invariant
under rotations in Rd , the kernels K RF and K NT are also invariant under rotations and can therefore
be written as functions of ∥x1 ∥2 , ∥x2 ∥2 and ⟨x1 , x2 ⟩. √
We will assume hereafter that the data is normalized with ∥x∥2 = d. Then we can write
   
RF (d) ⟨x1 , x2 ⟩ NT (d) ⟨x1 , x2 ⟩
KN (x1 , x2 ) = hRF , KN (x1 , x2 ) = hNT . (3.1.16)
d d
Kernels that only depend on the inner product of their inputs are also called inner-product or
dot-product kernels.

Remark 3.1.3. Let us comment on the normalization choice in these notes. First, for ∥x∥2 = d
and wi ∼ Unif(Sd−1 ) or N(0, I d /d), the input ⟨wi , x⟩ to the non-linearity σ has variance of order 1.
This is the correct behavior for the activation function to behave in a nontrivial manner. Second, we
(d) (d)
allow the kernels hRF and hNT to depend on the dimension, and hence the scaling of the argument
⟨x1 , x2 ⟩/d is somewhat arbitrary. However, the specific choice of Eq. (3.1.16) is motivated by the
(d) (d)
fact that, with this scaling the functions hRF , hNT converge to a well defined limit as d → ∞. More
precisely:
(d) (d)
1) For w ∼ N(0, I d /d), hRF and hNT are independent of d and are given by (with (G1 , G2 ) ∼
N(0, I 2 ))
n  p o
hRF (t) = EG1 ,G2 σ G1 σ tG1 + 1 − t2 G2 , (3.1.17)
n p o
hNT (t) = t EG1 ,G2 σ ′ G1 σ ′ tG1 + 1 − t2 G2
 
, (3.1.18)

where the expectation is taken with respect to (G1 , G2 ) ∼ N(0, I 2 ).


(d) (d)
2) For w ∼ Unif(Sd−1 ), hRF and hNT are nearly independent of d. Namely, because of the
(d) (d)
concentration of the norm of Gaussian random vectors, we have hRF → hRF and hNT → hNT
as d → ∞, the same kernels as for isotropic Gaussian weights.
Remark 3.1.4. More generally, consider a multilayer fully-connected neural network defined as

f (x; θ) := W L σ W L−1 σ W L−2 · · · σ(W 1 x) · · · , (3.1.19)

where the parameters are θ = (W 1 , . . . , W L ) with W L ∈ R1×dL , W ℓ ∈ Rdℓ+1 ×dℓ and d1 = d. If


(W ℓ )ij ∼iid N(0, τℓ2 ), then the expected kernel
K NT,L (x1 , x2 ) := Eθ {⟨∇θ f (x1 ; θ), ∇θ f (x2 ; θ)⟩} , (3.1.20)

is rotationally invariant. In particular, if the inputs are normalized ∥xi ∥2 = d, then the kernel
must take the form
 
NT,L (d) ⟨x1 , x2 ⟩
K (x1 , x2 ) = hNT,L . (3.1.21)
d
Similarly to the two-layer case, if we scale the parameters (W ℓ )ij ∼iid N(0, τℓ2 /dℓ ) and take
minℓ=1,...,L dℓ → ∞, we have pointwise convergence
NT,L
KN (x1 , x2 ) := ⟨∇θ f (x1 ; θ), ∇θ f (x2 ; θ)⟩ → K NT,L (x1 , x2 ) . (3.1.22)
This limiting kernel is referred to as the neural tangent kernel [JGH18].

22
3.2 Test error of KRR in the polynomial high-dimensional regime
In the previous section, we saw that under suitable distribution of the weights, ridge regression
with large enough network size can be approximated by kernel ridge regression with an inner-
product kernel. In the following, we characterize the risk of KRR for general inner-product kernels
(in particular the neural tangent kernel with any number of layers by Remark 3.1.4). We defer
to Chapters 4 and 5 the important question of how many neurons is needed for the finite-width
networks to have similar risk as their infinite-width limits.
Throughout this section we consider an isotropic model for the distribution of the covariates
xi ∈ Rd . We assume to be given i.i.d. samples {(yi , xi )}i≤n with
√ 
yi = f⋆ (xi ) + εi , xi ∼ Unif Sd−1 ( d) , (3.2.1)
√ √
where f⋆ ∈ L2 := L2 (Sd−1 ( d)) is a general square-integrable function on the sphere of radius d,
i.e. E{f⋆ (x)2 } < ∞, and the noise εi is independent of xi , with E{εi } = 0, E{ε2i } = τ 2 .
We will consider a general rotationally invariant kernel K(x1 , x2 ) = h(⟨x1 , x2 ⟩/d), where we
take h : [−1, +1] → R independent of the dimension. We further assume h to be ‘generic’, meaning
that in the basis of Hermite polynomials {Hek }k≥0 ,

√  X∞ √
1
h t/ d = cd,k Hek (t) , cd,k := EG∼N(0,1) [Hek (G)h(G/ d)] , (3.2.2)
k!
k=0

has all its Hermite coefficients1 satisfying dk/2 cd,k → ck > 0 as d → ∞ for all k ≥ 0. This
corresponds to a universality condition: if ck = 0, then the KRR estimator will not fit degree-k
spherical components of the target function, no matter the number of samples.
Several of the assumptions above have been partially relaxed in the literature, and we will
provide pointers below. However, for clarity of exposition, we consider the simplest possible setting.
We are interested in the KRR estimator fˆλ with kernel h as described above. Recall that fˆλ is
given by (see Eq. (3.1.4))
−1
fˆλ (x) = K(x, ·)T λI n + K n y. (3.2.3)

We are interested in the excess test error under square loss which we denote
n 2 o
RKRR (f⋆ ; X, y, λ) := Ex f⋆ (x) − fˆλ (x) . (3.2.4)

The next theorem characterizes the risk of KRR up to a vanishing constant in the high-
dimensional polynomial regime. For ℓ ∈ N, we denote by P≤ℓ : L2 → L2 the orthogonal projector
onto the subspace of polynomials of degree at most ℓ, P>ℓ := I − P≤ℓ and Pℓ := P≤ℓ P>ℓ−1 . Further,
od,P will denote the standard little-o in probability: h1 (d) = od,P (h2 (d)) if h1 (d)/h2 (d) converges to
zero in probability.

Theorem 2. Assume the data {(yi , xi )}i≤n are distributed according to model (4.2.1) and the
kernel function h satisfies the genericity condition (3.2.2).
1
The normalization is not important, but to be definite, we choose the standard one E[Hej (G)Hek (G)] = k!1j=k
(G ∼ N(0, 1)).

23
∥P>ℓ−1 f∗ ∥2L2

∥P>0 f∗ ∥2L2
∥P>ℓ f∗ ∥2L2

B(ψ, ζℓ ) V(ψ, ζℓ )
∥P>1 f∗ ∥2L2 n
dℓ /ℓ!

∥P>2 f∗ ∥2L2 0.1 1 10


ψ

∥P>3 f∗ ∥2L2

∥P>4 f∗ ∥2L2
log(n)
0 1 2 3 4 log(d)

Figure 3.1: A cartoon illustration of the test error of KRR in the polynomial high-dimensional scaling
n/dκ → ψ, as n, d → ∞, for any κ, ψ ∈ R>0 . The test error follows a staircase, with peaks that can occur
at each κ = ℓ ∈ N, depending on the effective regularization ζℓ and effective signal-to-noise ratio SNRℓ at
that scale.

(a) If dℓ+δ ≤ n ≤ dℓ+1−δ for some integer ℓ and constant δ > 0, then there exists a constant
λ⋆ = Θd (1) such that for any regularization parameter λ ∈ [0, λ⋆ ], we have (cf. Eq. (3.2.4))

RKRR (f⋆ ; X, y, λ) = ∥P>ℓ f⋆ ∥2L2 + od,P (1) · ∥f⋆ ∥2L2 + τ 2 .



(3.2.5)

Furthermore, no kernel method with dot-product kernel can do better (i.e. have smaller risk).

(b) If n/(dℓ /ℓ!) → ψ for some integer ℓ and constant


P ψ > 0, then denoting ζℓ = (λ + c>ℓ )/cℓ the
effective regularization at level ℓ with c>ℓ = k>ℓ ck , we have

RKRR (f⋆ ; X, y, λ) = ∥Pℓ f⋆ ∥2L2 · B(ψ, ζℓ ) + ∥P>ℓ f⋆ ∥2L2 + τ 2 · V(ψ, ζℓ ) + ∥P>ℓ f⋆ ∥2L2

(3.2.6)
+ od,P (1) · ∥f⋆ ∥2L2 + τ 2 ,


where the definitions of B(ψ, ζℓ ) and V(ψ, ζℓ ) can be found in [Mis22].

Remark 3.2.1. Part (a) of the above theorem was proven in [GMMM21] and was later generalized
to other RKHS in [MMM22] under a ‘spectral gap’ assumption. Part (b) was proven in [Mis22,
HL22a, XHM+ 22].

24
f0
P f
data (xi, yi)

Figure 3.2: A cartoon illustration of the benign overfitting phenomenon in kernel ridgeless regression. This
cartoon is supposed to capture the qualitative behavior of kernel min-norm interpolators in high dimension.
The interpolator decomposes in the sum of a smooth part, that captures the best low-degree approximation
of the target, and a spiky part that interpolates the noisy data.

In words, Eq. (3.2.5) implies that fˆλ only fits the projection of f⋆ onto low degree polynomials:
if ≪ n ≪ dℓ+1 , KRR learns the best degree-ℓ polynomial approximation to f⋆ and nothing else.
dℓ
Equivalently, we can decompose

fˆλ = P≤ℓ f⋆ + ∆ , (3.2.7)

where ∥∆∥L2 = (∥f⋆ ∥L2 +τ 2 )·od,P (1) does not contribute to the test error. For example, this theorem
implies that if n ≤ d1.99 , KRR can only fit the linear component of f⋆ . Each time log(n)/ log(d)
crosses an integer value, KRR learns one more degree polynomial approximation: the risk presents
a staircase decay when n increases, with peaks that can occur at n = dℓ /ℓ!. This phenomena is
illustrated with a cartoon in Figure 3.1.
As mentioned above, the conclusions of Theorem 2 apply to any rotationally invariant kernel
(under the genericity assumption). Notably, they apply to the neural tangent kernel of fully-
connected networks of any depth (as explained in Remark 3.1.4). In the linear regime and in high
dimension, depth does not appear to play an important role for fully-connected neural networks.

Remark 3.2.2 (Benign Overfitting). As discussed in Section 1.3, one particularly interesting so-
lution is the case λ = 0+:
n o
fˆ0 := argminf ∈H ∥f ∥H : f (xi ) = yi , ∀i ∈ [n] , (3.2.8)

which corresponds to the minimum-norm interpolating solution. In this case, the KRR solution
perfectly interpolates the noisy data yi = f⋆ (xi ) + εi . Problem (3.2.8) is sometimes referred to as
kernel ‘ridgeless’ regression following [LR20].

25
Theorem 2 establishes that KRR with λ = 0+ is optimal among all kernel methods when
dℓ ≪ n ≪ dℓ+1 , in the sense that it achieves the best possible test error. This is another example
of benign overfitting where interpolation does not harm generalization, as described in Section 2.4.
Recalling Eq. (3.2.7), we can decompose fˆ0 = P≤ℓ f⋆ + ∆ where
1) P≤ℓ f⋆ is a smooth component good for prediction;
2) ∆ is a spiky component useful for interpolation, ∆(xi ) = P>ℓ f⋆ (xi ) + εi , but which does not
contribute to the test error, since ∥∆∥L2 ≪ 1.
We illustrate benign overfitting in kernel regression with a cartoon in Figure 3.2.
Remark 3.2.3 (Multiple peaks in the risk curve). For n ≍ dℓ , KRR transitions from not fitting the
degree-ℓ polynomial components at all for n ≪ dℓ to fitting all degree-ℓ polynomials when n ≫ dℓ .
The behavior between these two regimes, for n ≍ dℓ , is more complex. Because of the degeneracy of
the kernel operator eigenvalues at that scale, the spectrum of the kernel matrix follows a shifted and
rescaled Marchenko-Pastur distribution. This can lead to a peak in the risk curve, whenever the
effective regularization ζℓ or the effective signal-to-noise ratio SNRℓ := ∥Pℓ f∗ ∥2L2 /(∥P>ℓ f∗ ∥2L2 + τ 2 )
are small enough. We refer the interested reader to [Mis22] for more details.

3.3 Diagonalization of inner-product kernels on the sphere


In the next section we will present some ideas of the proof of Theorem√2. Before doing that, it
is useful to provide some background on the functional space L2 (Sd−1 ( d)) of square-integrable
functions on the sphere. For a more in-depth presentation, we refer to [Sze39, Chi11, EF14].
We use the orthogonal decomposition
√ ∞
M
L2 (Sd−1 ( d)) = Vd,k , (3.3.1)
k=0
Lℓ−1
where Vd,ℓ is the space of degree-ℓ polynomials orthogonal to k=0 Vd,k (orthogonality is with
respect to the L2 inner-product ⟨f, g⟩L2 = E{f (x)g(x)}). Equivalently, Vd,k is the space of all
spherical harmonics of degree k, and has dimension
 
2k + d − 2 k + d − 3
dim(Vd,k ) = B(d, k) := . (3.3.2)
k k−1
Note that B(d, k) = (dk /k!)(1+od (1)). Consider {Yks }s≤B(d,k) an orthonormal basis of Vd,k . In par-
ticular, we have ⟨Yku , Yℓv ⟩L2 = δkℓ δuv , and the set {Yks }k≥0,s≤B(d,k) forms a complete orthonormal

basis of L2 (Sd−1 ( d)). Lℓ
It will be convenient to introduce the following notations: Vd,≤ℓ = L k=0 Vd,k the space of all
polynomials of degree at most ℓ, and its orthogonal complement Vd,>ℓ = ∞ k=ℓ+1 Vd,k . Further we
denote by Pk the orthogonal projection on Vd,k , by P≤ℓ = P0 + . . . + Pℓ the orthogonal projection
on Vd,≤ℓ , and P>ℓ = I − P≤ℓ the orthogonal projection on Vd,>ℓ .

d−1 ( d), i.e. a positive semidefi-
Consider now a rotationally
√ invariant
√ kernel defined on S
nite kernel2 K : Sd−1 ( d) × Sd−1 ( d) → R such that K(x1 , x2 ) = h(⟨x1 , x2 ⟩/d) for some mea-
2 P
This means a measurable function such that i,j≤m K(xi , xj )αi αj ≥ 0 for any m and any collection of points
(xi )i≤m , and weights (αi )i≤m .

26
surable function
√ h : [−1, 1] →√ R. To the kernel function K, we associate its integral operator
2 d−1 2 d−1
K : L (S ( d)) → L (S ( d)) defined by

Kf (x1 ) = Ex2 ∼Unif(Sd−1 (√d)) h(⟨x1 , x2 ⟩/d)f (x2 ) . (3.3.3)

The proof of Theorem 2 relies on the eigendecomposition of inner-product kernels on the sphere. We
know that by the spectral theorem for compact operators, the kernel function K can be diagonalized
as

X
K(x1 , x2 ) = λ2j ψj (x1 )ψj (x2 ) , (3.3.4)
j=1

where {ψj }j≥1 is an orthonormal basis of L2 (Sd−1 ( d)), and the {λ2j }j≥1 are the eigenvalues in
nonincreasing order λ21 ≥ λ22 ≥ λ23 ≥ · · · ≥ 0.
By rotational invariance, the subspaces Vd,k are eigenspaces of the kernel operator K, i.e. we
can write

X
2
K= ξd,k Pk , (3.3.5)
k=0

2 the eigenvalue associated to the eigenspace V .


where we now denote ξd,k d,k
Since {Yks : s ≤ B(d, k)} form an orthonormal basis of Vd,k , we have

B(d,k)
X
Pk f = Yks ⟨Yks , f ⟩L2 . (3.3.6)
s=1

Equivalently, the projector is represented as an integral operator


(d)
Pk f (x1 ) = B(d, k) · Ex2 {Qk (⟨x1 , x2 ⟩)f (x2 )}, , (3.3.7)
B(d,k)
(d) 1 X
Qk (⟨x1 , x2 ⟩) := Yks (x1 )Yks (x2 ) , (3.3.8)
B(d, k)
s=1

(d)
Note that Qk ( · ) must be a function of the inner product ⟨x1 , x2 ⟩ by rotational invariance, and
must be a polynomial of degree k because so are the {Yks }. Indeed, this is a special function known
(d)
as a Gegenbauer polynomial. The√ polynomials {Qk }k≥0 form
√ an orthogonal basis of L2 ([−d, d], νd )
where νd is the distribution of dx1 with x ∼ Unif(Sd−1 ( d)).
The following properties of Gegenbauer polynomials follow from the above representation:
(d) (d) PB(d,k)
1. Qk (d) = Ex {Qk (⟨x, x⟩)} = B(d, k)−1 s=1 E{Yks (x1 )2 }. Therefore
(d)
Qk (d) = 1 . (3.3.9)

2. Since projectors are idempotent (P2k = Pk ), it follows that

(d) (d) 1 (d)


Ex {Qk (⟨x1 , x⟩)Qk (⟨x, x2 ⟩) = Q (⟨x1 , x2 ⟩) . (3.3.10)
B(d, k) k

27
3. In particular
1
(d) (d)
⟨Qj , Qk ⟩L2 (τd ) = 1j=k . (3.3.11)
B(d, k)

Representing the projection operator in Eq. (3.3.5), we get the following diagonalization of
inner-product kernels,
∞ B(d,k)
X X
2
K(x1 , x2 ) = ξd,k Yks (x1 )Yks (x2 ) . (3.3.12)
k=0 s=1

In particular, this means that


2
n √ √ o
ξd,k = Ex Qdk ( dx1 )h(x1 / d) . (3.3.13)

(d)
N(0, 1) and therefore B(d, k)1/2 Qk converges
When d → ∞, the measure νd converges weakly to √
to the degree-k normalized Hermite polynomial Hek / k!. The ‘genericity condition’ (3.2.2) can be
2 B(d, k) → c > 0 as d → ∞.
restated as ξd,k k

3.4 Proof sketch


In this section, we outline the proof of Eq. (3.2.5) in Theorem 2.(a). The proof crucially depends
on controlling the empirical kernel matrix K n = (K(xi , xj ))1≤ij≤n . While this is a non-linear
(in the data) random matrix, which is usually hard to study, K n simplifies in our polynomial
high-dimensional regime. We explain below how K n can be approximately decomposed into a low-
rank matrix (coming from the low-degree polynomials) plus a matrix proportional to the identity
(coming from the high-degree non-linear part of the kernel).
Using the diagonalization of inner-product kernels in Eq. (3.3.12), we can decompose the kernel
matrix into a low-frequency and a high-frequency component,

X X
2 ≤ℓ
Kn = ξd,k Y kY T
k +
2
ξd,k Y kY T >ℓ
k =: K n + K n , (3.4.1)
k=0 k≥ℓ+1

where Y k = (Yks (xi ))i≤n,s≤B(d,k) ∈ Rn×B(d,k) is the matrix of degree-k spherical harmonics evalu-
ated at the training data points. We claim that the following properties hold:
Low-frequency part: K ≤ℓ ℓ
n has rank at most B(d, 0) + . . . + B(d, ℓ) = Θd (d ) much lower than n
(recall that we assumed n ≥ dℓ+δ ). Furthermore, one can show that for any k ≤ ℓ,

n−1 Y T
k Y k − I B(d,k) op
= od,P (1) . (3.4.2)

Therefore, we have
n  o
σmin K ≤ℓ 2
σmin Y k Y T = Θd,P (nd−ℓ ) = Θd,P (dδ ) ,

n ≥ min ξd,ℓ k (3.4.3)
k≤ℓ

where we used n ≥ dℓ+δ . Hence K ≤ℓ n corresponds to a low rank matrix with diverging
eigenvalues along the low-frequency component (degree at most ℓ polynomials) of K n .

28
High-frequency part: K >ℓ >ℓ
n is approximately proportional to the identity with ∥K n − γI n ∥op =
od,P (1), where
X X
2
γ= ξd,k B(d, k) = ck + od (1) . (3.4.4)
k≥ℓ+1 k≥ℓ+1

We are now ready to sketch the proof of Theorem 2.(a). Let us decompose the KRR solution
into a low-frequency (the projection on degree-ℓ polynomials) and a high-frequency component:
 −1
fˆλ (x) ≈ K(x, ·) K ≤ℓ
n + (λ + γ)I n y = fˆλ,≤ℓ (x) + fˆλ,>ℓ (x) , (3.4.5)

where
 −1
fˆλ,≤ℓ (x) := P≤ℓ fˆλ (x) = K ≤ℓ (x, ·) K ≤ℓ
n + (λ + γ)I n y, (3.4.6)
 −1
fˆλ,>ℓ (x) := P>ℓ fˆλ (x) = K >ℓ (x, ·) K ≤ℓ
n + (λ + γ)I n y. (3.4.7)

The risk can be written as the sum of contributions along P≤ℓ and P>ℓ :
n 2 o n 2 o n 2 o
E f⋆ (x) − fˆλ (x) = E P≤ℓ f⋆ (x) − fˆλ,≤ℓ (x) + E P>ℓ f⋆ (x) − fˆλ,>ℓ (x) (3.4.8)
| {z } | {z }
(I) (II)

We can bound these two terms separately:


Term (I): This term is equivalent to the test error of doing kernel ridge regression with kernel
K≤ℓ , regularization parameter λ + γ, target function P≤ℓ f⋆ (x) and data yi = P≤ℓ f⋆ (xi ) + ε̃i
where ε̃i = P>ℓ f⋆ (xi ) + εi (note that the noise ε̃ is not independent of P≤ℓ f⋆ (xi ) anymore but
is still uncorrelated). The dimension of the target space (i.e. Vd,≤ℓ ) and the rank of the kernel
are now Θd (dℓ ) much smaller than the number of samples n. Hence, term (I) corresponds to
the test error of KRR in low-dimension. We can therefore expect (and show) that this error
is vanishing:
(I) = ∥P≤ℓ f⋆ ∥2L2 · od,P (1) + Var(ε̃i ) · od,P (1) = ∥f⋆ ∥2L2 + τ 2 · od,P (1) .

(3.4.9)

Term (II): Similarly to K >ℓ , we can show that


n o
nEx K >ℓ (x, ·)T K >ℓ (x, ·) − κ · I n = od,P (1) , (3.4.10)
op
4 B(d, k) = Θ (nd−ℓ−1 ) = o (1) where we used that n ≤ dℓ+1−δ . We
P
with κ = k≥ℓ+1 nξd,k d d
can therefore bound
n o  −1 2
∥fˆλ,>ℓ ∥2L2 ≤ Ex K >ℓ (x, ·)T K >ℓ (x, ·) K ≤ℓ
n + (λ + γ)I n y
op 2
(3.4.11)
1 ∥y∥22 2 2

≤ od,P (1) · · = o d,P (1) · ∥f ∥
⋆ L 2 + τ ,
(λ + γ)2 n
where we used by Markov’s inequality that
n−1 ∥y∥22 = Od,P (1) · E{n−1 ∥y∥22 } = Od,P (1) · ∥f⋆ ∥2L2 + τ 2 .

(3.4.12)
Therefore the contribution of the second term is
(II) = ∥P>ℓ f⋆ ∥2L2 + ∥f⋆ ∥2L2 + τ 2 · od,P (1) .

(3.4.13)

29
Combining the bounds (3.4.11) and (3.4.13) in Eq. (3.4.8) yields part (a) of Theorem 2.

Remark 3.4.1 (Self-induced regularization). From the proof of Theorem 2, we see that the high-
frequency component K >ℓ n of the kernel matrix concentrates on identity and plays the role of an
additive self-induced ridge regularization γ > 0. Hence the effective regularization parameter of
the KRR solution is λ + γ (see for example Eq. (3.4.6)), which is bounded away from zero even
when taking λ = 0+. This explains why kernel ridgeless regression generalizes well even when
interpolating noisy training data.

30
Chapter 4

Random features

The random feature (RF) model is a two-layer neural network in which the first layer weights are
not trained, and kept equal to their random initialization. It was initially introduced in [RR07] as
a randomized approximation for kernel methods. Here we regard it as a particularly simple limit
case of linearized neural networks, cf. Section 1.4.
We will use the RF model as a toy model to explore two important questions about neural
networks trained in the linear regime:

(i) How large should N be for the generalization error of RF to match the error associated with
its infinite-width kernel limit (N → ∞) obtained in Chapter 3?

(ii) What is the impact on the performance of RF when N is not chosen sufficiently large?

While the RF model presents a simpler structure than the (finite width) neural tangent model, we
expect them to display similar behavior, provided we match the number of parameters in the two
models. Chapter 5 partly confirms this picture.

4.1 The random feature model


The random feature model is given by

N
1 X
fRF (x; a) = √ ai σ(⟨wi , x⟩) , (4.1.1)
N i=1

where a = (a1 , . . . , aN ) ∈ RN are the free parameters that are trained, while the weights (wi )i≤N
are fixed random wi ∼iid ν. Equivalently, the RF model can be seen as training the second layer
weights of a two-layer neural network, while keeping the first-layer weights fixed.
Recall the definition of the featurization map ϕRF : Rd → RN given by

1
ϕRF (x) = √ [σ(⟨w1 , x⟩); . . . ; σ(⟨wN , x⟩)] , (4.1.2)
N

31
so that fRF (x; a) = ⟨a, ϕRF (x)⟩. We are interested in the ridge regression solution
( n )
X 2
â(λ) = argmina∈RN yi − fRF (xi , a) + λ∥a∥22 (4.1.3)
i=1
n o
2
= argmina∈RN y − Φa 2
+ λ∥a∥22 , (4.1.4)

T
where the design matrix Φ = ϕRF (x1 )T , . . . , ϕRF (xn )T ∈ Rn×N has entries Φij = √1 σ(⟨xi , w j ⟩).

N
Throughout this chapter, we will consider sampling wi ∼iid Unif(Sd−1 ).
Let us emphasize here the reason why the RF model is much simpler to study than the neural
tangent model. The entries of the featurization map (4.1.2) (and the columns of Φ) are iid with
respect to the randomness in wi . On the other hand, in the case of the neural tangent model,
the entries of the featurization map ϕNT (x) = (N d)−1/2 [σ ′ (⟨w1 , x⟩)xT ; . . . ; σ ′ (⟨wN , x⟩)xT ] are only
block independent with respect to the randomness in wi . Tracking the dependency structure in
the design matrix makes the analysis of the neural tangent model harder. See Chapter 5 for an
example of such an analysis.

4.2 Test error in the polynomial high-dimensional regime


In Chapter 3, we saw that as N → ∞, the ridge regression solution converges to the kernel
ridge regression solution with inner-product kernel KRF . What is the impact of finite N on the
generalization error of the RF model?
We consider the same setting as in Section 3.2. Assume we are given i.i.d. samples {(yi , xi )}i≤n
with
√ 
yi = f⋆ (xi ) + εi , xi ∼ Unif Sd−1 ( d) , (4.2.1)

where f⋆ ∈ L2 := L2 (Sd−1 ( d)) and the noise εi is independent of xi , with E{εi } = 0, E{ε2i } = τ 2 .
We consider the RF model as in Eq. (4.1.1). We will assume the following ‘genericity’ condition
on the activation function σ: (i) |σ(x)| ≤ c0 exp(c1 |x|) for some constants c0 , c1 > 0, and (ii) for
any k ≥ 0, σ has non-zero Hermite coefficient µk (σ) := EG {σ(G)Hek (G)} = ̸ 0 (where G ∼ N(0, 1)).
We are interested in the random feature ridge regression (RFRR) solution
−1
fˆRF (x; â(λ)) = ϕRF (x)T ΦT λI n + ΦΦT y, (4.2.2)

and consider the excess test error with square loss which we denote by
n 2 o
RRF (f⋆ ; X, y, W , λ) := Ex f⋆ (x) − fRF (x; â(λ)) . (4.2.3)

Theorem 3. Assume dℓ1 +δ ≤ n ≤ dℓ1 +1−δ , dℓ2 +δ ≤ N ≤ dℓ2 +1−δ , max(N/n, n/N ) ≥ dδ for
some integers ℓ1 , ℓ2 and constant δ > 0. Denote ℓ = min(ℓ1 , ℓ2 ). Further assume that σ verifies
the genericity conditions. Then there exists a constant λ⋆ = Θd ((N/n) ∧ 1) such that for any
regularization parameter λ ∈ [0, λ⋆ ] and all η > 0, we have (cf. Eq. (4.2.3))

RRF (f⋆ ; X, y, W , λ) = ∥P>ℓ f⋆ ∥2L2 + od,P (1) · ∥f⋆ ∥2L2+η + τ 2 .



(4.2.4)

32
In this theorem, the number of neurons N and the number of samples n play symmetric roles.
The test error of RFRR is determined by the minimum of N and n, as long as N , n and dℓ for
integers ℓ are well separated.
We can distinguish two regimes:

Underparametrized regime: For N ≪ n (less parameters than number of samples), the test
error is limited by the number of random features and the approximation error dominates,
i.e.,
RRF (f⋆ ; X, y, W , λ) = Rapp (f⋆ ; W ) + od,P (1) ,
n 2 o (4.2.5)
Rapp (f⋆ ; W ) := inf Ex f⋆ (x) − fˆRF (x; a) ,
a∈RN

which corresponds to the best fit of f⋆ by a linear combination of N random features. If


dℓ ≪ N ≪ dℓ+1 , we have fˆRF ≈ P≤ℓ f⋆ and the model fits the best degree-ℓ polynomial
approximation to f⋆ and nothing else.
To get an intuitive picture of this phenomenon, ˆ
2 d−1
√notice that the model fRF is contained in the
subspace span{σ(⟨wi , · ⟩) : i ≤ N } ⊂ L (S ( d)) of dimension N . By parameter-counting,
in order to approximate the space of degree-ℓ polynomials of dimension Θd (dℓ ), we need the
number of parameters to be Ωd (dℓ ), which matches the requirement in our theorem.

Overparametrized regime: For n ≪ N (more parameters than number of samples), the test
error is limited by the number of samples and the statistical error dominates, i.e.,

RRF (f⋆ ; X, y, W , λ) = RKRR (f⋆ ; X, y, λ) + od,P (1) , (4.2.6)

where RKRR is the risk of KRR with kernel KRF (corresponding to the infinite-width limit
N → ∞) as obtained in Theorem 2. If dℓ ≪ n ≪ dℓ+1 , we have fˆRF ≈ P≤ℓ f⋆ . Again, this
matches the parameter-counting heuristic: the information-theoretic lower bound to learn the
space of degree-ℓ polynomials is Ωd (dℓ ) samples.

To summarize, denoting RRF (n, N ) the test error achieved by the RF model with n samples and
N neurons, as long as n, N and dℓ for integers ℓ are well separated, we have

RRF (n, N ) = max{RRF (n, ∞), RRF (∞, N )} + od,P (1) , (4.2.7)

where RRF (∞, N ) corresponds to the approximation error (4.2.5) and RRF (n, ∞) to the statistical
error (4.2.6).
Before sketching the proof of Theorem 3, let us make three further comments:

Optimal overparametrization. In the RF model, we are free to choose the number of neurons
N and it is interesting to ask the following question: given a sample size n, how large should
we choose N ? Theorem 3 shows that the test error decreases until N ≈ n, and then stays
roughly the same as long as N ≥ n1+δ (for some δ > 0, although this specific condition is
mainly due to our proof technique) and achieves the same test error as the limiting kernel
method N = ∞ (Theorem 2). Further overparametrization does not improve the risk.

Benign overfitting. The flipside of the last remark is that we can keep increasing N above n1+δ
(hence overparametrize the model by an arbitrary amount), without deteriorating the gener-
alization error.

33
Figure 4.1: Double descent in the test error of random feature ridge regression with ReLu activation
(σ = max{0, x}). The data are generated using yi = ⟨β, xi ⟩ with ∥β∥2 = 1 and ψ2 = n/d = 3. The
regularization parameter is chosen to be λ = 10−8 (left frame) and λ = 10−3 (right frame). The continuous
black lines correspond to theoretical predictions. The colored symbols are numerical simulations averaged
over 20 instances and for several dimensions d. (Figure from [MM19].)

Optimality of interpolation. In the overparametrized regime, the minimum-norm interpolating


solution λ = 0+ achieves the best test error. Hence, similarly to KRR as discussed in Chapter
3, the random feature model displays the benign overfitting property. The mechanism for
RF is very similar to the KRR case (Remark 3.4.1): the high-degree non-linear part of the
activation function σ acts as noise in the features and behaves as an effective self-induced
regularization (see next section).

4.3 Double descent and proportional asymptotics


In Theorem 3, we assumed that the values of n and N are sufficiently well separated. Indeed, when
N is roughly equal to n, the design matrix Φ becomes nearly square and its condition number
becomes large. This leads to a peak in the test error at the interpolation threshold1 N = n. This
has been called the ‘double descent’ phenomenon, and we already encountered it in Section 2.3.
In the present context, the interpretation of this phenomenon is even clearer. As the number
of parameters increases, the test error first follows the classical U-shaped curve, with error initially
decreasing and then increasing as it approaches the interpolation threshold. This is in line with
the classical picture of bias-variance tradeoff, where increasing the model complexity reduces model
misspecification (i.e. the model is too simple to fit the data) but increases sensitivity to statisti-
cal fluctuations (i.e. the model may overfit to noise in the training data). However, unlike the
traditional U-shaped curve, the test error descends again beyond the interpolation threshold, and
the optimal test error is achieved when the number of parameters is significantly larger than the
number of samples.
1
The name ‘interpolation threshold’ comes from the fact that as long as N ≥ n, the RF model has enough
parameters to interpolate the training data. If λ = 0+, then the training error becomes 0 at N = n.

34
The double descent phenomenon in the random feature model has been studied in the propor-
tional asymptotic regime (when N, n, d → ∞ with N/d → ψ1 and n/d → ψ2 ) in [MM19]. Figure
4.1 reports the test error of random feature ridge regression obtained numerically for a fixed value
of ψ2 = 3 and two different regularization parameters λ = 10−8 (left) and λ = 10−3 (right).
Continuous lines correspond to the theoretical predictions from [MM19], that take the form
sketched below.
Theorem 4. Let the activation function σ satisfy the genericity and growth assumptions stated
above. Define
2 2 2
b0 := E{σ(G)}, b1 := E{Gσ(G)}, b⋆ := E{σ(G)2 } − b0 − b1 , (4.3.1)
2 2 2
where expectation is with respect to G ∼ N(0, 1). Assuming 0 < b0 , b1 , b⋆ < ∞, define ζ by

b1
ζ := . (4.3.2)
b⋆
and consider proportional asymptotics N/d → ψ1 , n/d → ψ2 . Finally, let the regression func-
tion {fd }d≥1 be such that fd (x) = βd,0 + ⟨β d,1 , x⟩ + fdNL (x), where fdNL (x) is a centered rotation-

ally invariant Gaussian process indexed by x ∈ Sd−1 ( d), satisfying Ex {fdNL (x)x} = 0. Assume
E{fdNL (x)2 } → F⋆2 , ∥β d,1 ∥22 → F12 .
Then for any value of the regularization parameter λ > 0, the asymptotic prediction error of
random feature ridge regression satisfies
h i
2 2
EX,W ,ε,f NL RRF (fd ; X, W , λ) − F12 B(ζ, ψ1 , ψ2 , λ/b⋆ ) + (τ 2 + F⋆2 )V (ζ, ψ1 , ψ2 , λ/b⋆ ) + F⋆2 = od (1) ,
d

where EX,W ,ε,f NL denotes expectation with respect to data covariates X, first layer weigth vectors
d
W , data noise ε, and fdNL the nonlinear part of the true regression function (as a Gaussian process).
The functions B, V are explicitly given.
Overall, the behavior of random feature ridge regression (RFRR) in high dimension can be
summarized as follows: as n increases while N is kept fixed, the test error of RFRR initially
concentrates on the statistical error (test error of kernel ridge regression, with N = ∞) for n ≪ N ,
then peaks at the interpolation threshold n = N (the double descent phenomenon), and finally
saturates on the approximation error (test error with n = ∞) for n ≫ N .

4.4 Polynomial asymptotics: Proof sketch


In this section we outline the proof of Theorem 3.
Recall that the random feature ridge regression solution is given by
−1
fˆRF (x; â(λ)) = ϕRF (x)T ΦT λI n + ΦΦT y. (4.4.1)

To prove Theorem 3, we need to control the behavior of the feature matrix Φ ∈ Rn×N :
 
σ(⟨x1 , w1 ⟩) σ(⟨x1 , w2 ⟩) · · · σ(⟨x1 , wN ⟩)
 σ(⟨x2 , w1 ⟩) σ(⟨x2 , w2 ⟩) · · · σ(⟨x2 , wN ⟩) 
Φ= . (4.4.2)
 
.. .. ..
 . . . 
σ(⟨xn , w1 ⟩) σ(⟨xn , w2 ⟩) · · · σ(⟨xn , wN ⟩)

35
Similarly to the kernel matrix in Section 3.4, we show that in the polynomial regime Φ can be
approximately decomposed into a low-rank matrix (coming from the low-degree polynomials up
to degree ℓ = min(ℓ1 , ℓ2 )) plus a matrix that is proportional to an orthogonal matrix, whose span
is approximately orthogonal to the low rank part. This decomposition again rely on an operator
diagonalization. √ √
Consider the activation function σ, seen as an integral operator from
√ L2 (Sd−1 ( d)) to L2 (Sd−1 ( d))
(for convenience, we rescale the argument to have w ∼ Unif(Sd−1 ( d))):
Z √
g(w) → σ(⟨x, w⟩/ d)g(w)τd (dw) ,

√ √
where τd √denotes the uniform measure on Sd−1 ( d). From assumption (i), we have σ(⟨·, ·⟩/ d) ∈
L2 (Sd−1 ( d)⊗2 ) and from the spectral theorem for compact operators, the function σ can be
diagonalized as

√ ∞
X
σ(⟨x, w⟩/ d) = λj ψj (x)ϕj (w) , (4.4.3)
j=1


where (ψj )j≥1 and (ϕj )j≥1 are two orthonormal bases of L2 (Sd−1 ( d)), and the λj are the eigen-
values with nonincreasing absolute values |λ1 | ≥ |λ2 | ≥ · · · ≥ 0. By rotational invariance of the
operator (see Section 3.3), σ is diagonalizable with respect to the spherical harmonics orthogonal
basis
∞ B(d,k)
√ X X
σ(⟨x, w⟩/ d) = ξd,k Yks (x)Yks (w) . (4.4.4)
k=0 s=1

Assumption (ii) on the activation σ implies that |ξd,k | ≍ d−k/2 .


Using the notations in Eq. (4.4.3), let ψ k = (ψk (x1 ), . . . , ψk (xn ))T be the evaluation of the k-th
left eigenfunction at the n data points, and let ϕk = (ϕk (w1 ), . . . , ϕk (wN ))T be theP evaluation of
the k-th right eigenfunction at the N random first-layer weights. Further, let k(ℓ) := k≤ℓ B(d, k).
Similarly to Section 3.4, we expand Φ into a low-frequency and a high-frequency component:

X
Φ= λ j ψ j ϕT
j =: Φ≤ℓ + Φ>ℓ , (4.4.5)
j=1
k(ℓ) ∞
X X
Φ≤ℓ = λj ψ j ϕT
j =: ψ ≤ℓ Λ≤ℓ ϕT
≤ℓ , Φ>ℓ = λj ψ j ϕT
j , (4.4.6)
j=1 j=k(ℓ)+1

where Λ≤ℓ = diag(λ1 , . . . , λk(ℓ) ), ψ ≤ℓ ∈ Rn×k(ℓ) is the matrix whose j-th column is ψ j , and
ϕ≤ℓ ∈ RN ×k(ℓ) is the matrix whose j-th column is ϕj . Recalling |ξd,k | ≍ d−k/2 , for d sufficiently
large, Λ≤ℓ contains exactly all the singular values associated to the spherical harmonics of degree
at most ℓ.
Notice that the matrix Φ has the same distribution under the mapping Φ ↔ ΦT , n ↔ N ,
xi ↔ wj . Hence, without loss of generality let us consider the overparametrized case N ≥ n1+δ
with dℓ+δ ≤ n ≤ dℓ+1−δ . Then we have the following properties:

36

Low-frequency part: Φ≤ℓ / N has rank k(ℓ) = Θd (dℓ ) much smaller than min(n, N ) = n ≥ dℓ+δ .
Furthermore,

ψT
≤ℓ ψ ≤ℓ /n − I k(ℓ) op
= od,P (1) , ϕT
≤ℓ ϕ≤ℓ /N − I k(ℓ) op
= od,P (1) , (4.4.7)
√  √
and σmin Φ≤ℓ / N = Ωd,P (mink≤ℓ n|ξd,k |) = Ωd,P (dδ ). Hence, Φ≤ℓ / N corresponds to a
low-rank spiked matrix along the low-frequency components (degree-ℓ polynomials).

High-frequency part: The singular values of Φ>ℓ / N concentrates, i.e., Φ>ℓ ΦT
>ℓ /N −γI n op =
od,P (1) where
X
2
γ= ξd,k B(d, k) . (4.4.8)
k≥ℓ+1

Furthermore, ∥Φ>ℓ ϕT≤ℓ /N ∥op = od,P (γ


1/2 ). Hence the high-frequency component Φ
>ℓ behaves
similarly to a random noise matrix, independent of ϕ≤ℓ .

Based on the above description of the structure of the feature matrix, ridge regression with
respect to the random features σ(⟨wj , ·⟩) is essentially equivalent to doing a regression against a
polynomial kernel of degree ℓ, with ℓ depending only on the smaller of the sample size and the
network size. The same mechanism as in kernel regression appears: the high-frequency part of the
activation σ effectively behaves as noise in the features and can be replaced by an effective ridge
regularization γ (another example of self-induced regularization).

Remark 4.4.1. Working with the uniform measure over the sphere is particularly convenient
because one can exploit known properties of spherical harmonics. However, the analysis in the
polynomial regime can be extended to other distributions of the covariate vectors xi and of the
weights wj under appropriate abstract conditions [MMM22].

37
38
Chapter 5

Neural tangent features

In this section we consider the neural tangent model associated to a finite-width fully connected
two-layer neural network. This model was already introduced in Section 1.4 and captures the
behavior of two-layer networks initialized at random, and trained in the linear/lazy regime. As
throughout these notes, we focus on square loss.
As discussed in Chapter 3, the infinite-width limit of linearized neural networks corresponds
to kernel ridge regression (KRR) with respect to a certain inner product kernel. Here we want to
address the following key question:
How large the width of a two-layer network should be for the test error to be well
approximated by its infinite-width limit?
It turns out that the answer to this question is a natural generalization of the one we obtained
for the random feature model in the last chapter. Roughly speaking, if the number of parameters
is large compared to the sample size (hence the model is overparametrized), then the test error is
well approximated by the infinite width test error. Notice that the total number of parameters in
this case is N d, and therefore this condition translates into
Nd ≫ n , (5.0.1)
whereby for the random feature model, the overparametrization condition amounted to N ≫ d.

5.1 Finite-width neural tangent model


Recall that the neural tangent model is defined by
N
1 X
fNT (x; b) = √ ⟨bi , x⟩σ(⟨wi , x⟩) . (5.1.1)
N d i=1

Here b = (bi , . . . , bN ) ∈ RN d is a vector of parameters that are learnt from data, while (wi )i≤N
are i.i.d. first-layer weights at initialization, with common distribution ν. In our analysis, we will
take ν = Unif(Sd−1 ) to be the uniform measure over the sphere. This model can be equivalently
described in terms of the NT featurization map
1
ϕNT (x) = √ [σ ′ (⟨w1 , x⟩)xT ; . . . ; σ ′ (⟨wN , x⟩)xT ] . (5.1.2)
Nd

39
We then have fNT (x; b) = ⟨b, ϕNT (x)⟩.
We are given i.i.d. samples {(yi , xi )}i≤n and are interested in the ridge regression solution
n o
2
b̂(λ) = argminb∈RN d y − Φb 2 + λ∥b∥22 , (5.1.3)

where we introduced the design matrix Φ ∈ Rn×N d by


 
—ϕNT (x1 )—
 —ϕ (x2 )— 
NT
Φ :=  . (5.1.4)
 
..
 . 
—ϕNT (xn )—
We assume the same model as in the previous chapter, namely
√ 
yi = f⋆ (xi ) + εi , xi ∼ Unif Sd−1 ( d) , (5.1.5)
where εi is noise independent of xi , with E{εi } = 0, E{ε2i } = τ 2 . The corresponding test error is
n 2 o
RNT (f⋆ ; X, y, W , λ) := Ex f⋆ (x) − fˆNT (x; b̂(λ)) . (5.1.6)

Remark 5.1.1 (RF+NT). Recall (cf. Section 1.4) that the full model obtained by linearizing
two-layer neural networks with respect to both first layer and second layer weights takes the form
flin (x; a, b) = f0 (x) + fRF (x; a) + fNT (x; b) , (5.1.7)
where f0 (x) is the neural network at initialization.
We introduced two simplifications here. First, we dropped the initialization term f0 (x). As
already mentioned in Section 1.4, we can indeed endorse f0 (x) = 0 at the price of doubling the
number of neurons and choosing the initialization (wj )j≤2N so that wN +j = wj and aN +j = −aj .
The analysis of this initialization is the same as the one pursued here.
Second, we drop the term fRF (x; a). This can be shown to have a negligible effect in high
dimension (for a generic activation function). Informally, it amounts to reducing the number of
parameters from N d + N to N d, which is a negligible change.
Remark 5.1.2 (Complexity at prediction time). So far, our presentation mirrored the one of the
RF model in the last chapter. However, there is an important practical difference in the complexity
of evaluating the functions fRF ( · ; a) and fNT ( · ; b) which is the task we need to perform at prediction
(a.k.a. inference) time.
Consider first the RF model. Organizing the weights wi as rows of a matrix W ∈ RN ×d ,
the function fRF ( · ; a) can be evaluated at a point x using O(N d) operations. P First, we compute
z = W x (which can be done with O(N d) operations), and then we output N i=1 ai σ(zi ) which can
be done in O(N ) operations (if evaluating the one-dimensional function σ takes O(1) operations).
N ×d . Next we evaluate
Next consider the NT model. We arrange PN the ′ parameters bi into B ∈ R
u = Bx, z = W x and finally output i=1 ui σ (zi ). Under similar assumptions, this results in
O(N d) operations.
At first sight, the two models behave similarly. However, as we saw in the last section, the
‘approximation power’ of the RF model is controlled by the number of parameters p = N . Similarly,
in the NT model we will see that it is controlled by p = N d. Hence the complexity of prediction
at constant number of parameters is O(pd) for the RF model compared to O(p) for the NT model.
In other words, the NT model is much simpler to evaluate in high dimension.

40
Remark 5.1.3 (Complexity of SGD training). Consider next training either the RF or the NT
model using stochastic gradient descent (SGD) under quadratic loss. Of course, in this case the cost
function is quadratic and we can use faster algorithms than SGD. However, these algorithms are
not available, or of common use for actual non-linear networks, and we hope to gain some insights
into those cases. We assume a batch of size B, which we denote by {(y1 , x1 ), . . . (yB , xB )}, and will
just compute the complexity of a single SGD step.
For the RF model the gradient takes the form (here R̂B (a) is the empirical risk with respect to
the batch)
∇a R̂B (a) = −σ(W X T )r , (5.1.8)

where W ∈ RN ×d is the matrix whose i-th row is wi , X ∈ RB×d is the matrix whose j-th row is
xi , and r ∈ RB , with ri = yi − fRF (xi ; a). Assuming σ can be evaluated in O(1) time, the above
gradient can be evaluated in time O(N dB) = O(pdB).
For the NT model, we view the parameters as a matrix b ∈ RN ×d whose i-th row corresponds
to neuron i. We then have
—σ(wT T
 
1 xi ) · xi —
B  —σ(wT xi ) · xT — 
X 2 i
∇b R̂B (b) = − ri  . (5.1.9)
 
..
i=1
 . 
—σ(wT T
N xi ) · xi —

Again, this can be evaluated in time O(N dB) = O(pB). The complexity is significantly smaller
than for RF.

5.2 Approximation by infinite-width KRR: Kernel matrix


The empirical kernel matrix plays a crucial role in ridge regression, and hence it is useful to begin
by analyzing its structure. We denote the kernel matrix associated to the finite-width NT model
by K N ∈ RN d×N d . By definition, its i, j-th entry is (K N )ij = ⟨ϕNT (xi ), ϕNT (xj )⟩. Explicitly:
N
1 X ′
(K N )ij = σ (⟨wk , xi ⟩)σ ′ (⟨wk , xj ⟩)⟨xi , xj ⟩ . (5.2.1)
Nd
k=1

The corresponding infinite-width kernel matrix is given by the expectation of the above
⟨xi , xj ⟩
· Ew σ ′ (⟨w, xi ⟩)σ ′ (⟨w, xj ⟩) .

(K)ij = (5.2.2)
d
Here and below, we denote by Ew , Pw expectation and probability with respect to w at fixed
(xi : i ≤ n).
By the law of large numbers (and under very minimal conditions on σ, e.g. Ew [σ ′ (⟨w, xi ⟩)2 ] <
∞), we have limN →∞ (K N )ij = (K)ij for fixed i, j. However, we are interested in N, n, d all diverg-
ing simultaneously, and the convergence of the matrix entries does not provide strong information
on the overall matrix (e.g. its eigenvalues), which is our main focus.
The activation function will enter our estimates through the following parameters
X 1
vℓ (σ) := ⟨Hek , σ ′ ⟩2L2 (N(0,1)) . (5.2.3)
k!
k≥ℓ

41
Here ⟨f, g⟩L2 (N(0,1)) = E{f (G)g(G)} (where G ∼ N(0, 1)) is the usual L2 inner product with respect
to the Gaussian measure, and Hek is the k-th Hermite polynomial, with the standard normalization
⟨Hej , Hek ⟩L2 (N(0,1)) = k!δjk . We further make the following assumption on σ.
Assumption 1 (Polynomial growth). We assume that σ is weakly differentiable with weak deriva-
tive σ ′ satisfying |σ ′ (x)| ≤ B(1 + |x|)B for some finite constant B > 0, and that vℓ (σ ′ ) > 0. (The
latter is equivalent to σ not being a maximum degree-ℓ polynomial.)
We next state a kernel concentration result.
Theorem 5 (Kernel concentration). Let γ = (1 − ε0 )vℓ (σ) for some constant ε0 ∈ (0, 1). Further,
let β > 0 be arbitrary. Then, there exist constants C ′ , C0 > 0 such that the following hold.
Define the event:  √ √
Aγ := K ⪰ γI n , ∥X∥op ≤ 2( n + d) . (5.2.4)
For any X ∈ Aγ , we have
r ′
!
−1/2 −1/2 (n + d)(log(nN d))C ′ (n + d)(log(nN d))C
Pw K KN K − In op
> + ≤ d−β .
Nd Nd
(5.2.5)
Further if n ≤ dℓ+1 /(log d)C0 , then P(Aγ ) ≥ 1 − n−β for all n large enough.
Remark 5.2.1. Note that the event Aγ only depends on the matrix of covariates X. Hence, we
view Aγ as a set in Rn×d and write X ∈ Aγ if X satisfies the stated conditions. Equation (5.2.5)
holds for any such X, not necessarily random ones (the probability being only with respect to W ).
Since P(Aγ ) ≥ 1 − n−β under the stated conditions, we also have that, if n ≤ dℓ+1 /(log d)C0 ,
then with probability at least 1 − d−β (both with respect to X and W )

r
−1/2 −1/2 (n + d)(log(nN d))C ′ (n + d)(log(nN d))C
K KN K − I n op ≤ + . (5.2.6)
Nd Nd
Remark 5.2.2. A first attempt at proving concentration would be to try to control ∥K N − K∥op .
However, this results in suboptimal bounds because K has eigenvalues on several well separated
scales. Indeed, we saw in Chapter 3 that K has one eigenvalue of order n, d eigenvalues of order
n/d, approximately dk /k! eigenvalues of order n/dk for any k ≤ ℓ, and n − O(dℓ ) eigenvalues of
order 1.
In Theorem 5 we bound ∥K −1/2 K N K −1/2 − I n ∥op , hence effectively looking at each group of
eigenvalues at different scales.
Assume for simplicity n ≥ c1 d for some constant c1 . There is little loss of generality in doing
so. Indeed, even if we know that the target function is linear f∗ (x) = ⟨β ∗ , x⟩, it is information
theoretically impossible to achieve prediction error less that (say) half the error of the trivial
predictor fˆ(x) = 0, unless n ≥ c1 d for some constant c1 .
In this setting the norm bound in Theorem 5 (equivalently, the right-hand side of Eq. (5.2.6))
is small as soon as
Nd
≫ n. (5.2.7)
(log N d)C
Modulo the log factors, this is the sharp overparametrization condition N d ≫ n. This is also our
first piece of evidence that the relevant control parameter is the ratio of number of parameters to
sample size, rather than number of neurons to sample size.

42
5.3 Approximation by infinite-width KRR: Test error
By itself, controlling the kernel matrix as we did in Theorem 5 does not allow to bound the test
error, which involves evaluating the regression function at a fresh test point. Explicitly, we have
−1
fˆ(x; λ) = KN (x, X) K N + λI N y (5.3.1)
T
−1 −1
= KN (x, · ) K N + λI N f ∗ + KN (x, X) K N + λI N ε, (5.3.2)
where we used the notations KN (x, · ) := (KN (x, xi ) : i ≤ n) ∈ Rn and f ∗ := (f∗ (xi ) : i ≤ n) ∈
Rn . Consider, to be concrete, the bias contribution to the test error. We get
−1 2
B(λ) = Ex f∗ (x) − KN (x, · )T K N + λI N

f∗
T −1
= Ex f∗ (x)2 − 2Ex f∗ (x)KN (x, · )
 
K N + λI N f∗
−1 (2) −1
+ f ∗ K N + λI N K N K N + λI N f∗ ,
(2) (2)
where K N ∈ Rn×n is the matrix with entries (K N )ij = Ex {KN (xi , x)KN (x, xj )}.
Clearly the bound on the kernel matrix afforded by Theorem 5 is not sufficient to control this
expression. Indeed, proving the result below is significantly more challenging.
Theorem 6. Let B, c0 > 0, ℓ ∈ N be fixed. Then, there exist constants C0 , C, C ′ > 0 such that the
following holds.
If σ satisfies Assumption 1, and further
d2
c0 d ≤ n ≤ , if ℓ = 1, (5.3.3)
(log d)C0
dℓ+1
dℓ (log d)C0 ≤ n ≤ , if ℓ > 1. (5.3.4)
(log d)C0
If N d/(log(N d))C ≥ n, then for any λ ≥ 0,
r
 n(log(N d))C ′ 
RNT (f∗ ; λ) = RKRR (f∗ ; λ) + Od,P τ+2
Nd
where τ+2 := ∥f∗ ∥2L2 + τ 2 .
As anticipated, this theorem establishes that the infinite width kernel is a good approxima-
tion to the finite width kernel under the overparametrization condition (5.2.7) which is only a
polylogarithmic factor above the optimal condition N d ≫ n.
Roughly speaking, the condition of Eqs. (5.3.3) and (5.3.4) rules out cases in which n is compa-
rable to dℓ for an integer ℓ. As we saw in Chapter 3, when n ≍ dℓ , the structure of the infinite width
kernel matrix K is more complex, and the KRR with respect to inner product kernels ‘partially’
fits the degree ℓ component of the target function. We expect this condition to be a proof artifact.
Figures 5.1 and 5.2 (from [MZ22]) illustrate the content of Theorem 6 via numerical simulations.
In these simulations we fix d = 20, τ = 0.5 and consider the target function
r r r
4 2 1
f∗ (x) = He1 (⟨β ∗ , x⟩) + He2 (⟨β ∗ , x⟩) + He4 (⟨β ∗ , x⟩) . (5.3.5)
10 10 120
We can clearly observe the main phenomena captured by Theorem 6. The train error vanishes
for N d ≥ n, while the test error converges to a limit for large N d/n. This limit is accurately
reproduced by (infinite-width) KRR.

43
Train error Test error 2.00
0.8 1.75
1.50
log(n)/log(d)

log(n)/log(d)
0.6 1.25
2 2
0.4 1.00
0.75
0.2 0.50
0.25
1 0.0 1
1 2 3 1 2 3
log(Nd)/log(d) log(Nd)/log(d)
Figure 5.1: Neural Tangent min-norm regression (λ = 0+) in d = 20 dimensions (see text for details). Left:
train error as a function of the sample size n and number of parameters N d. Right: test error. Results are
averaged over 10 realizations. (From [MZ22].)

Test error
1.4 n is 46
1.2 n is 444
n is 4304
1.0
0.8
0.6
0.4
0.2
0.0
4 6 8 10 12
log(Nd)

Figure 5.2: Test error Neural Tangent min-norm regression (λ = 0+) in d = 20 dimensions (see text for
details). Color curves refer to n = 46, 444 and 4304, and dashed lines report the corresponding KRR limit.
(From [MZ22].)

44
Remark 5.3.1. As we saw in Chapter 3, under the conditions of Eqs. (5.3.3), (5.3.4), KRR with
respect to an inner product kernel approximates the projection of the target function onto degree
ℓ polynomials. Putting this result together with Theorem 6 (under the conditions in the theorem’s
statement) yields
r r
 n(log(N d))C′ n(log n)C 
RNT (f∗ ; λ) = RPRR (f∗ ; λ + vℓ (σ)) + Od,P τ+2 + τ+2 . (5.3.6)
Nd dℓ+1
Here RPRR (f∗ ; λ′ ) is the risk of polynomial ridge regression (PRR), i.e. ridge regression with respect
to polynomials of maximum degree ℓ, at regularization λ′ . Note that Eq. (5.3.6) is somewhat more
quantitative than the statement Chapter 3. Namely, the difference between KRR and PRR is upper
bounded by (n/dℓ+1 )1/2 , up to a polylogarithmic factor.
We refer to [MZ22] for a more complete discussion, but point out that Eq. (5.3.6) displays one
more example of self-induced regularization. Namely, the ridge regularization parameter in PRR is
equal λ + vℓ (σ). The original parameter λ gets inflated by an additive term vℓ (σ) that comes from
the high frequency part of the activations.

5.4 Proof sketch: Kernel concentration


The proof of Theorem 5 is simple and instructive, and hence we present a brief sketch. We will
work under the simplifying assumption that supt |σ ′ (t)| ≤ M < ∞, and n ≥ d.
The proof uses the following classical matrix concentration inequality [Ver18][Thm. 5.4.1]. (Be-
low, for a random variable U , esssup(U ) := inf{t ∈ R : P(U ≤ t) = 1}.)
Theorem 7 (Matrix Bernstein inequality). Let {Z k }k≤N be a sequence of symmetric and inde-
pendent random matrices Z k ∈ Rn×n . Define
N
X
v := E[Z 2k ] , L := max esssup∥Z k ∥op . (5.4.1)
op k≤N
k=1

Then, for any t ≥ 0.


N
N 2 t2 /2
 1 X   
P Zk ≥ t ≤ 2n · exp − , (5.4.2)
N op v + N Lt/3
k=1

In other words, matrix Bernstein states that, with probability at least 1 − n−β
N
1 X C(β) p 
Zk ≤ max v log n, L log n . (5.4.3)
N op N
k=1

We define
D k := diag σ ′ (⟨wk , x1 ⟩), . . . , σ ′ (⟨wk , xn ⟩) .

(5.4.4)
Letting S N := K −1/2 K N K −1/2 − I denote the matrix that we want to bound, we have
N
1 X
SN = Zk , (5.4.5)
N
k=1
1 −1/2
Z k := K D k XX T D k K −1/2 − I . (5.4.6)
d

45
We begin by bounding the coefficient L in Bernstein inequality. By triangle inequality
1
∥Z k ∥op ≤ λmin (K)−1 M 2 ∥XX T ∥op + 1 . (5.4.7)
d
Therefore, on the event Aγ

M2 √ √
L≤ 4( n + d)2 + 1
γd
n
≤C ,
d
where the constant C depends on M , γ.
Next consider the coefficient v. We have of course E[Z 2k ] ⪰ 0. On the other hand,

1  −1/2
E[Z 2k ] = E K D k XX T D k K −1 D k XX T D k K −1/2 − I
d2
M2 
⪯ 2 E K −1/2 D k XX T XX T D k K −1/2
γd
M2 √ √
⪯ 2 4( n + d)2 E K −1/2 D k XX T D k K −1/2

γd
M2 √ √
= 4( n + d)2 I .
γd

Summing over k, we get v ≤ CN n/d.


Summarizing, there exists a constant C (depending only on M and γ) such that

Cn CN n
L≤ , v≤ .
d d
Applying Bernstein inequality (5.4.3), we get
r !
−1/2 −1/2 C Nn n
K KN K −I ≤ max log n, log n
op N d d
r 
n n
≤C log n + log n .
Nd Nd

This proves Theorem 5 in the case of bounded σ ′ . The general case requires to handle large values
of σ ′ (⟨wk , xi ⟩) via a truncation argument. This results in the extra logarithmic factors, which are
likely to be a proof artifact.

46
Chapter 6

Why stop being lazy (and how)

These lecture notes were focused so far on learning within a training regime in which the model
f (x; θ) is well approximated by its first order Taylor expansion (in the model parameters) around
its initialization. This style of analysis is useful in elucidating some puzzling properties of modern
machine learning:
1. Overparametrization helps optimization.

2. Gradient-based algorithms select a specific model among all the ones that minimize the em-
pirical risk.

3. The specific models can overfit the training data (vanishing training error) and yet generalize
to unseen data.
The linear regime provides a simple and analytically tractable setting in which these phenomena
can be understood in detail.
Finally, there is experimental evidence that, in some settings, the linear theory captures the
behavior of real SGD-trained neural networks [LXS+ 19, GSJW20] (see also Appendix B).
On the other hand, it is important to emphasize that linearized neural networks are significantly
less powerful than fully trained neural networks. In other words, while neural networks can be
trained in a regime in which they are well approximated by their first order Taylor expansion, this
is the byproduct of specific choices on the parametrization. Other choices are possible, leading to
different and potentially better models.
Following [COB19] we will use the somewhat irreverent term ‘lazy training’ to refer to training
schemes that produce networks in the linear regime. In this chapter we will discuss limitations of
this approach, and how other training schemes overcome them. In particular, we will clarify that
lazy training is not a necessary consequence of the infinite width limit. Other non-lazy infinite-width
limits exist, and can outperform lazy ones.

6.1 Lazy training fails on ridge functions


One of the simplest problems on which lazy training is suboptimal is provided by ‘ridge functions’,
i.e. functions that depend on a one-dimensional projection of the covariate vector. √
As in the previous chapters, assume to be given data {(xi , yi )}i≤n , where xi ∼ Unif(Sd−1 ( d)),
yi = f∗ (xi ) + εi and εi is noise with E{εi } = 0, E{ε2i } = τ 2 . A ridge function is a target function

47
of the form:

f∗ (x) = φ(⟨w∗ , x⟩) , (6.1.1)

where1 φ : R → R and ∥w∗ ∥2 = 1.


We saw in the previous chapters that the learning under isotropic covariates in the linear regime
is controlled by the decomposition of f∗ into polynomials. Let us compute the mass of f∗ in the
subspace Vd,k of polynomials of degree k that are orthogonal to polynomials of maximum degree
k − 1. Using the theory of Section 3.3, we obtain
(d) 2
lim ∥Pk f∗ ∥2L2 = lim B(d, k) · Ex φ(⟨w∗ , x⟩)Qk (⟨w∗ , x⟩)

d→∞ d→∞
1  2
= E φ(G)Hek (G) .
k!
In particular2

X 1  2
lim ∥P>ℓ f∗ ∥2L2 = E φ(G)Hek (G) =: bℓ .
d→∞ k!
k=ℓ+1

Unless φ is a polynomial, we have bℓ > 0 strictly for all ℓ. This simple remark has an immediate
consequence. We saw in previous chapters that if the sample size is n ≪ dℓ+1 , then linearized
neural network only learn the best approximation of the target by degree-ℓ polynomials. Hence,
their excess risk is bounded away from zero. This consequence is stated more precisely below.

Corollary 6.1.1. Fix δ > 0. Let RX (f∗ ), X ∈ {KRR, RF, NT} be (respectively) the excess test error
of kernel ridge regression (with inner product kernel), random feature ridge regression, or neural
tangent regression. For X = RF, assume N ≥ n1+δ , and for X = NT, assume N d ≥ n1+δ . Under
the data model above (in particular, yi = φ(⟨w∗ , xi ⟩) + εi ), if n ≤ dℓ+1−δ then the following lower
bound holds in probability:

lim inf RX (f∗ ) ≥ bℓ . (6.1.2)


n,d→∞

Further unless φ is a polynomial, we have bℓ > 0 strictly for all ℓ.

This is somewhat disappointing. After all, the function (6.1.1) looks extremely simple, and
only has d free parameters (plus, eventually O(1) parameters to approximate the one-dimensional
function φ). One extreme case is given by φ = σ. Note that Corollary 6.1.1 does not change in this
case: linearized neural networks cannot learn a target function that coincides with a single neuron
from any polynomial number of samples.
It is natural to wonder whether learning ridge functions might be hard for some hidden reason.
It turns out that ridge functions can often be learnt efficiently, even if we limit ourselves to gradient-
based methods. For instance, consider using a one-neuron network f (x; w) := σ(⟨w, x⟩) in the case
1
In fact, the term ridge function refers in applied mathematics and statistics to the broader class of targets of the
form f∗ (x) = ψ(U T∗ x) where ψ : Rk → R and U T∗ U ∗ = I k [LS75]. Unfortunately, this terminology can be confusing
because we are applying ridge regression to learn ridge functions, and the two uses of “ridge” are not related!
2
Inverting follows from dominated convergence if, for instance |φ(t)| ≤ C exp(C|t|) for a constant C > 0 and all
t ∈ R.

48
φ = σ. We perform gradient descent with respect to the empirical risk:
n
X 2
R̂n (w) = yi − σ(⟨w, xi ⟩) . (6.1.3)
i=1

The corresponding excess test error is R(w) = E{(σ(⟨w∗ , x⟩) − σ(⟨w, x⟩))2 }. The following result
from [MBM18] implies that gradient descent succeeds in learning this target.

Proposition 6.1.2. Under the above data distribution (namely, xi ∼ Unif(Sd−1 ( d)) and yi =
σ(⟨w∗ , xi ⟩) + εi ) assume σ is bounded and three times differentiable with bounded derivatives. Fur-
ther assume σ ′ (t) > 0 for all t. Then there exists a constant C = Cσ,τ such that, if n ≥ C d log d,
the following happens:

1. The empirical risk R̂n (w) has a unique (local, hence global) minimizer ŵn ∈ Rd .

2. Gradient descent converges globally to ŵn .

3. The excess test error at this minimizer is bounded as


r
d log n
R(ŵn ) ≤ C . (6.1.4)
n

Considering —to be definite— the setting of this proposition, and learning using two layer
networks. We find ourselves in the following peculiar situation:

1. A one-neuron network can learn the target from n = O(d log d) samples, which roughly
matches the information theoretic lower bound.

2. As the number of neurons gets large, we learn from [JGH18] that a two-layer neural network
is well approximated by the corresponding NT model.

3. The NT model cannot learn the same target from any polynomial (in d) number of samples.

In other words, the learning ability of two-layer networks seems to deteriorate as the number
of neurons increases. Needless to say, this is a very counterintuitive behavior.
What is ‘going wrong’ ?
As we will see in the next section, point 2 above is what is going wrong. Wide neural networks
are well approximated by their linear counterpart only under a specific scaling of the network
weights. Other behavior (and other infinite-width limits) are obtained if different scalings are
chosen. Under these scalings, the obstructions and counterintuitive behaviors above disappear.

6.2 Non-lazy infinite-width networks (a.k.a. mean field)


Let us restart from scratch, and write once more the parametric form of a two-layer neural network,
which we first wrote down in Eq. (1.1.7):
N
X
fˆ(x; Θ) = αN ai σ(⟨wi , x⟩) , θ i = (ai , wi ) ∈ Rd+1 . (6.2.1)
i=1

49

Here we denoted collectively by Θ := (ai , wi ) : i ≤ N the network parameters. We mentioned
briefly in Chapter 1 that in certain regimes, such a network is well approximated by its first order
Taylor expansion around the initialization. Roughly speaking, this is the case when N is large, but
only if we scale αN in a suitable way, i.e. αN ≍ N −1/2 .
We want to revisit the infinite-width limit, under a different scaling of the model, namely
N
ˆ 1 X
f (x; Θ) = ai σ(⟨wi , x⟩) . (6.2.2)
N
i=1

This is sometimes referred to as the ‘mean field’ scaling, after [MMN18], which we will follow here.
It is useful to denote by θ i = (ai , wi ) ∈ Rd+1 the vector of parameters for neuron i, and define
σ∗ (x; (a, w)) := aσ(⟨w, x⟩). Consider SGD training with respect to square loss. We will fix a
stepsize η and index iterations by t ∈ {0, η, 2η, . . . }. The SGD iteration reads:

θ t+η
i = θ ti + η(yI(t) − fˆ(xI(t) ; Θt ))∇θ σ∗ (xI(t) ; θ ti ) , (6.2.3)

where I(t) is the index of the sample used at step t of SGD. We will have in mind two settings:
(i) Online SGD, whereby at each step a fresh sample from the population is used (in this case we are
optimizing the population error); (ii) Standard SGD with batch size 1, whereby I(t) ∼ Unif([n]).
When the stepsize η becomes small, SGD effectively averages over a large number of samples
in a fixed time interval [t, t + ∆], and it is natural to expect that the above dynamics is well
approximated by the following gradient flow
t
θ̇ i = Ey,x (y − fˆ(x; Θt ))∇θ σ∗ (x; θ ti )} .

(6.2.4)

Here Ey,x is the expectation over the population distribution for online SGD, or over the sample
for standard SGD. We can conveniently rewrite the above flow by introducing the functions

V (θ) = −Ey,x [yσ∗ (x; θ)] (6.2.5)


= −Ey,x [yaσ(⟨w, x⟩)],
U (θ 1 , θ 2 ) = Ex [σ∗ (x; θ 1 ) σ∗ (x; θ 2 )] (6.2.6)
= Ex [a1 σ(⟨w1 , x⟩) a2 σ(⟨w2 , x⟩)] .

We can then rewrite Eq. (6.2.4) as


N
t 1 X
θ̇ i = −∇θ V (θ ti ) − ∇U (θ ti , θ tj ) . (6.2.7)
N
j=1

This expression makes it clear that the right-hand side depends only on θ ti and on the empirical
distribution of the neurons
N
1 X
ρ̂N,t := δθti . (6.2.8)
N
i=1

Indeed, we can further rewrite Eq. (6.2.7)


Z
t
θ̇ i = −∇θ V (θ ti ) − ∇V (θ ti , θ ′ ) ρ̂N,t (dθ ′ ) . (6.2.9)

50
We can put on our physicist’s hat and interpret Eq. (6.2.9) as describing the evolution of N
particles, where the velocity of each particle is a function of the density at time t. Hence the density
must satisfy a continuity partial differential equation (PDE):

∂t ρ̂N,t = ∇θ · ρ̂N,t (θ)∇θ Ψ(θ; ρ̂N,t ) , (6.2.10)
where
Z
Ψ(θ; ρ) = V (θ) + U (θ, θ ′ ) ρ(dθ ′ ) . (6.2.11)

Let us point out that Eq. (6.2.10) is a completely equivalent description of the gradient flow (6.2.4),
and holds for any finite N , provided one is careful about defining gradients of delta functions3 and
similar technicalities.
The formulation of Eq. (6.2.10) has two interesting properties:
• The evolution takes place in the same space for different values of N , this is the space of
probability measures ρ̂t in d + 1 dimensions. While we initially introduced Eq. (6.2.10)
for probability measures that are sums of point masses, the same PDE makes sense more
generally.
• Equation (6.2.10) ‘factors out’ the invariance of Eq. (6.2.4) under permutations of the neurons
{1, . . . , N }.
Because of these reasons, it is very easy (at least formally) to take the N → ∞ limit in
Eq. (6.2.10). Assume that the initialization satisfies ρ̂N,0 ⇒ ρinit as N → ∞, where ⇒ denotes the
weak convergence of measures. For instance this is the case if
{(a0i , w0i )}i≤N ∼iid ρinit . (6.2.12)
Then (under suitable technical conditions) it can be proven that ρ̂N,t ⇒ ρt for any t, where ρt
satisfies

∂t ρt = ∇θ · ρt (θ)∇θ Ψ(θ; ρt ) , (6.2.13)
ρ0 = ρinit .
We also note that the potential Ψ(θ; ρ) can be interpreted as the first order derivative of the
population error with respect to a change of density of neuron at θ. Using physics notations:
δR(ρ)
Ψ(θ; ρ) = , (6.2.14)
δρ(θ)
where we defined
Z
1 n 2 o
R(ρ) := E f∗ (x) − σ∗ (x; θ) ρ(dθ) (6.2.15)
2
Z Z
1 1
2
= E(f∗ (x) ) + V (θ)ρ(dθ) + U (θ, θ ′ )ρ(dθ)ρ(dθ ′ ) .
2 2
In conclusion, by scaling the function f (x; Θ), we obtained a different limit description of
infinitely wide networks. Rather than a linearly parametrized model, which is fitted via min-norm
regression, we obtained a fully nonlinear model.
3
More precisely, Eq. (6.2.10) has to hold in weak sense, i.e. integrated against a suitable test function.

51
Remark 6.2.1. Nonlinear continuity equations of the form (6.2.13) have been studied for a long
time in mathematics, starting from the seminal work of Kac [Kac56] and McKean [MJ66]. Within
this literature, this is known as the McKean-Vlasov equation. We refer to Appendix B for further
pointers to this line of work.

6.3 Learning ridge functions in mean field: basics


Summarizing, the two-layer neural network (6.2.1) behaves in different ways depending on how we
scale the coefficient αN . In particular:

• If αN ≍ 1/ N , for large enough N , the neural network is well approximated by the neural
tangent model. In particular, for n ≤ d1+δ , the network only learns the projection of the
target function onto linear functions.

• RIf αN ≍ 1/ N , the SGD trained network converges as N → ∞ to the network f (x; ρt ) :=
a σ(⟨w, x⟩) ρt (da, dw), where ρt solves the PDE (6.2.13).
What is the behavior of the PDE (6.2.13) under the data distribution introduced in Section 6.1,
namely xi ∼ N(0, I d ), yi = φ(⟨w∗ , xi ⟩)+εi ? A fully rigorous treatment of this question is somewhat
intricate and goes beyond the scope of this overview. We will limit ourselves to some simple
considerations, and overview recent progress on this topic in Appendix B. We will assume that the
conditions of Proposition 6.1.2 hold.
Consider an uninformative initialization. In the present case, this means the first layer weights
wi have spherically symmetric distribution. For the sake of concreteness, we can assume them to
be Gaussian and take

ρinit (da, dw) = PA ⊗ N(0, γ 2 I d /d) , (6.3.1)

where the distribution PA of the second-layer weights can be arbitrary. (The scaling of the covari-
ance above with d is chosen so that ∥wi ∥ = Θ(1) for large d.)
Now, the data distribution is invariant under rotations that leave w∗ unchanged. As a con-
sequence, it is not hard to show that the evolution (6.2.13) is equivariant under the same group.
Namely, let R ∈ Rd×d be a rotation such that Rw∗ = w∗ . For a probability distribution ρ(da, dw)
on R × Rd , we let R# ρ be the push forward under this rotation4 (acting trivially on the first factor
R). Then if ρ0 and ρ′0 = R# ρ0 are two initializations for the PDE, if ρt , ρ′t are the corresponding
solutions to Eq. (6.2.13), then we necessarily have ρ′t = R# ρt .
Since the initialization (6.3.1) is invariant under such rotations, ρt will be invariant for all t.
In other words, under ρt , (a, w) is uniformly random, given (a, ⟨w∗ , w⟩, ∥P⊥ w∥2 ), where P⊥ is the
projection orthogonal to w∗ . We can therefore write a PDE for the joint distribution of these three
quantities which we denote by ρt (da, ds, dr). Writing z = (a, s, r), ρt is obtained by solving the
PDE
1
∂t ρt = ∂a (ρt ∂a Ψ(z; ρt )) + ∂s (ρt ∂s Ψ(z; ρt )) + ∂r (rρt ∂r Ψ(z; ρt )) . (6.3.2)
r
The initialization (6.3.1) translates into

ρinit = PA ⊗ N(0, γ 2 /d) ⊗ Qd−1,γ , (6.3.3)


4
R# ρ is the law of (a, Rw) when (a, w) ∼ ρ.

52
Figure 6.1: Conjectured behavior of the prediction error as a function of width for different scalings of the
network.

where
√ Qk,γ is the law of the square root of a chi-squared with k degrees of freedom, rescaled by
γ/ k + 1.
Let R(ρt ) be the prediction risk of Eq. (6.2.15), written in terms of ρt (using the fact that w
is uniformly random conditional on a, s, r). It follows from the general theory that t 7→ R(ρt ) is
non-decreasing. If it can be shown that

lim R(ρt ) ≤ ε0 , (6.3.4)


t→∞

then this directly implies that, for any ε > 0, SGD achieves test error upper bounded (with high
probability) by ε0 + ε, from O(d) samples. Indeed, the PDE (6.2.13) can be shown to approximate
SGD provided the stepsize satisfies η ≤ 1/(C0 d) and therefore a time of order one translates in a
number of iterations (or samples) of order d (assuming online SGD).
Indeed we expect Eq. (6.3.4) to hold with ε0 = 0 in many cases of interest, e.g. under the
assumptions on σ = φ in Proposition 6.1.2. A proof of this fact would be too long for these notes.
Partial evidence is provided by the following facts:

1. The reduced PDE (6.3.2) is 3-dimensional and can be solved numerically, confirming Eq. (6.3.4).

2. For PA = δ1 and γ small, if we change SGD and do not update the second layer weights, the
resulting learning dynamics is similar to the one of a single neuron, covered by Proposition
6.1.2, and therefore an approximation argument can be employed to bound the risk.

We conclude that, under the setting of Proposition 6.1.2, very wide neural networks with mean-
field scaling learn ridge functions from O(d) samples.
While we do not have a precise characterization of the risk for intermediate values of N , we
sketch in Figure 6.1 the (partly conjectural) behavior of the prediction errors for different scalings
of the network, when learning a ridge function with φ = σ.

53
54
Appendix A

Summary of notations

For a positive integer, we denote by [n] the set {1, 2, . . . , n}. For vectors u, v ∈ Rd , we denote
⟨u, v⟩ = u1 v1 + . . . + ud vd their scalar product, and ∥u∥2 = ⟨u, u⟩1/2 the ℓ2 norm. We denote by
Sd−1 (r) = {u ∈ Rd : ∥u∥2 = r} the sphere of radius r in d dimensions (we will simply write for the
unit ball Sd−1 := Sd−1 (1)).
Given a matrix A ∈ Rn×m , we denote ∥A∥op = max∥u∥2 =1 ∥Au∥2 its operator norm and by
2 1/2 its Frobenius norm. If A ∈ Rn×n is a square matrix, the trace of A is
P 
∥A∥F = i,j Aij
denoted by Tr(A) = i∈[n] Aii . Further, given a positive semidefinite matrix B ∈ Rd×d and for
P

vectors u ∈ Rd , we denote ∥u∥B = ∥B 1/2 u∥2 = ⟨u, Bu⟩1/2 the weighted ℓ2 norm.
Given a probability space (X , ν), we denote L2 (X ) := L2 (X , ν) the space of square integrable
1/2
functions on (X , ν), and ⟨f, g⟩L2 (X ) = Ex∼ν {f (x)g(x)} and ∥f ∥L2 (X ) = ⟨f, f ⟩L2 (X ) respectively
the scalar product and norm on L2 (X ) (we will sometimes write ⟨·, ·⟩L2 when (X , ν) is clear from
context). √The space L2 (R, γ) plays an important role throughout these notes, where γ(dx) =
2
e−x /2 dx/ 2π is the standard Gaussian measure. We will often prefer the notation G ∼ N(0, 1)
to denote the standard normal distribution, and L2 (N(0, 1)) := L2 (R, γ). The set of Hermite
polynomials {Hek }k≥0 forms an orthogonal basis of L2 (R, γ), where Hek has degree k. We follow
the classical normalization:

⟨Hek , Hej ⟩L2 (N(0,1)) = EG∼N(0,1) {Hek (G)Hej (G)} = k!1j=k .

In particular, for any function g ∈ L2 (N(0, 1)), we have the decomposition


∞ ∞
X µk (g) X (µk (g))2
g(x) = Hek (x) , µk (g) := EG∼N(0,1) {g(G)Hek (G)} , ∥g∥2L2 = ,
k! k!
k=0 k=0

where µk (g) is sometimes referred to as the k-th Hermite coefficient of g.


We use Od ( · ) (resp. od ( · )) for the standard big-O (resp. little-o) relations, where the subscript
d emphasizes the asymptotic variable. Furthermore, we write f = Ωd (g) if g(d) = Od (f (d)), and
f = ωd (g) if g(d) = od (f (d)). Finally, f = Θd (g) if we have both f = Od (g) and f = Ωd (g).
We use Od,P ( · ) (resp. od,P ( · )) the big-O (resp. little-o) in probability relations. Namely, for
h1 (d) and h2 (d) two sequences of random variables, h1 (d) = Od,P (h2 (d)) if for any ε > 0, there
exists Cε > 0 and dε ∈ Z>0 , such that

P(|h1 (d)/h2 (d)| > Cε ) ≤ ε, ∀d ≥ dε ,

55
and respectively: h1 (d) = od,P (h2 (d)), if h1 (d)/h2 (d) converges to 0 in probability. Similarly, we will
denote h1 (d) = Ωd,P (h2 (d)) if h2 (d) = Od,P (h1 (d)), and h1 (d) = ωd,P (h2 (d)) if h2 (d) = od,P (h1 (d)).
Finally, h1 (d) = Θd,P (h2 (d)) if we have both h1 (d) = Od,P (h2 (d)) and h1 (d) = Ωd,P (h2 (d)).

56
Appendix B

Literature survey

Chapter 1 (The linear regime)


The connection between neural networks trained by gradient descent and kernel methods was
first1 elucidated in [JGH18]. They showed that, under a specific scaling of weights at initialization
and learning rate, neural networks trained by gradient flow converge to a kernel regression solution
in the infinite-width limit (N → ∞). The specific kernel of this solution was termed the neural
tangent kernel (NTK). A striking implication of this finding is that gradient flow converges to zero
training error —a global minimizer— despite the non-convexity of the full optimization landscape.
Following this intuition, [LL18, DZPS18] proved —under the same scaling of parameters— global
convergence of gradient descent for two-layer neural networks with ReLu activation and sufficiently
large but finite number of neurons. Subsequent studies extended the proof of global convergence
to deep neural networks and more general architectures [DLL+ 19, AZLS19b, AZLS19a, LXS+ 19,
ZCZG20]. However, these results require stringent overparametrization conditions on the network
width. For example, [DZPS18] requires two-layer neural networks with a number of neurons N of at
least Ω̃(n6 /λ40 ), where n is the training sample size and λ0 is the minimum eigenvalue of the empirical
neural tangent kernel matrix. With the goal of establishing global convergence for networks with
realistic widths, a large body of work has steadily pushed down the overparametrization condition
on N [JT19, KH19, SY19, ZG19, OS19, OS20, ZCZG20]. In particular, [OS20] showed that under
mild assumptions on the data distribution, quadratic overparametrization N d = Ω(n2 ) is sufficient
for global convergence of two-layer neural networks. This condition can be further improved to near
optimal overparametrization N d = Ω(n log(n)) by modifying the parameter scaling and choosing
a suitable initialization, as shown in [BMR21]. In addition to global convergence of the training
error, a number of studies have considered the generalization properties of neural networks in this
regime and proved that low test error can be achieved under various data distribution assumptions
[LL18, ADH+ 19a, CFW+ 21, CG19, NCS19, JT19].
As shown in [COB19], the connection to kernel methods is not restricted to neural networks,
but extends to more general non-linear models trained by gradient descent. Indeed, it originates
from a particular choice in the scaling of the model parameters. With this specific scaling, a small
change in the weights produces a large change in the output of the non-linear model but only a
small change in its Hessian (see also discussion in [LZB20]). As a consequence, neural networks
1
Several connections between neural networks and kernel methods have been discussed earlier in the literature, e.g.
see [Nea95, CS09, DFS16, LBN+ 18, NXB+ 18] and references therein. However, in these cases, the correspondence
with kernel models only holds for random neural networks, i.e. at initialization of the SGD dynamics, while [JGH18]
holds, under certain conditions, for the entire training trajectory.

57
in this optimization regime have their weights barely moving, while they converge to 0 training
loss. This led [COB19] to dub this regime the lazy training regime. Notably, the network can be
approximated throughout training by its linearization around its random weights at initialization,
which is the starting point for these notes.
For overparametrized neural networks, i.e. underconstrained optimization problems, there exist
many global minimizers with zero training loss. These interpolators achieve very different gener-
alization errors. Despite being trained with no explicit regularization that would promote ‘good’
models, the solutions found in practice generalize well to test data [ZBH+ 21]. A popular explana-
tion for this performance postulates an ‘implicit regularization’ from the training algorithm itself
[NTS14, NBMS17]. In words, generalization is implicitly controlled by the dynamics of the opti-
mization algorithm, which selects a good minimizer, and not by adding an explicit regularization
to the risk. For example, in the linear regime, gradient descent selects the minimizer closest from
the initialization in weighted ℓ2 distance (see Remark 1.4.1). Major research activity has been
devoted in recent years to characterizing implicit regularization for various algorithms, including
gradient descent in matrix factorization [GWB+ 17, LLL20], mirror descent [GLSS18a], gradient
descent on separable data [SHN+ 18, JT18b], linear neural networks [GLSS18b, JT18a, YKM20]
and neural networks with homogeneous activation functions [LL19, JT20]. Despite these efforts,
the precise complexity measure that is implicitly controlled by SGD in neural networks remains
poorly understood, except in restricted settings.

Chapter 2 (Linear regression under feature concentration)


In the case of Gaussian features, this model is also known as the ‘Gaussian design model’
[DW18]. The asymptotic risk of ridge regression with random features was computed by [Dic16]
for isotropic Gaussian features xi ∼ N(0, I p ) in the proportional asymptotics p, n → ∞ with
p/n → γ ∈ (0, ∞). These results were generalized by [DW18] to anisotropic features xi = Σ1/2 z i ,
where z i has independent entries with bounded 12th moment. [HMRT22] later derived similar
results under weaker assumptions on (β ∗ , Σ).
The fact that ridge regression with isotropic random features displayed a ‘double descent’ pat-
tern was pointed out in [AS17]. However, reproducing the full phenomenology observed empirically,
with the minimum risk achieved at large overparametrization, require coefficient vectors β ∗ aligned
with Σ. [RMR21, WX20] computed the asymptotic risk in such settings. [HMRT22] derived similar
results under weaker assumptions on (β ∗ , Σ). The results in [HMRT22] hold non-asymptotically,
with explicit error bounds, using the anisotropic local law proved in [KY17].
The works listed above only considered the proportional parametrization scaling, i.e., C −1 ≤
p/n ≤ C, which prohibits features coming from highly overparametrized (p ≫ n) or kernel (p = ∞)
models. A general scaling (including p = ∞) was recently studied in [CM22]. Using a novel
martingale argument to prove deterministic equivalents, they derived non-asymptotic predictions
for the test error, with nearly optimal rates for the error bounds on the bias and variance terms,
namely Õ(n−1/2 ) and Õ(n−1 ) which match the fluctuations of respectively the local law and average
law for the resolvent.
Despite the simplicity of this first model, it is connected to more complex models via universality.
For example, it was shown in some settings that kernel regression [Mis22] and random feature
regression [MM19] have the same asymptotic test errors as some equivalent Gaussian models, where
the non-linear feature maps are replaced by Gaussian vectors with matching first two moments.
The recent interest in interpolators was prompted by the experimental results in [ZBH+ 21,

58
BMM18] which showed that deep neural networks and kernel methods can generalize well even
when they interpolate noisy data. This surprising phenomenon —often referred to as benign over-
fitting following [BLLT20]— is at odds with the classical picture in statistics, where we expect
interpolating noisy training data to lead to unreasonable models with large test errors. Recent
work has instead proved that several standard learning models can indeed interpolate benignly
under certain conditions, including the Nadaraya-Watson estimator [BRT19], kernel ridgeless re-
gression [LR20, GMMM21] and max-margin linear classifiers [MRSY19, Sha22]. As a testbed to
understand interpolation learning, much focus has been devoted to studying interpolation in linear
regression with (sub-)Gaussian features [HMRT22, BLLT20, TB20, MVSS20, KZSS21]. In partic-
ular, [BLLT20] showed sufficient conditions for consistency of the minimum ℓ2 -norm interpolating
solution in terms of an effective rank, that were later refined in [CM22]. Further a number of studies
have argued that benign overfitting in (kernel) ridge regression is a high dimensional phenomenon
[RZ19, BBP22].

Chapter 3 (Kernel ridge regression)


The test error of kernel ridge regression (KRR) was investigated in a number of papers in the
past [BBM05, CDV07, Wai19]. In particular, [CDV07] showed that KRR achieves minimax optimal
rates over classes of functions under certain capacity condition on the kernel operator and source
condition on the target function. However these earlier results have limitations. First, they require
a strictly positive ridge regularization and therefore do not apply to interpolators. Second, they
often only characterize tightly the decay rate of the error, i.e. as n → ∞ for fixed dimension d. In
these notes, we are instead interested in deriving the precise test error of KRR in high dimension
where both n and d grow simultaneously. Finally, in contrasts to earlier work, the results discussed
in these chapters hold for a given target function f∗ (instead of worst case over a class of function).
The study of kernel models in high dimension was initiated by the seminal work [EK10]. El
Karoui analysed empirical kernel matrices of the form K = (h(⟨xi , xj ⟩/d))i,j≤n ∈ Rn×n in the
proportional asymptotics n ∝ d, with xi = Σ1/2 z i where z i ∈ Rd has independent entries with
bounded 4 + ε moment. [EK10] showed that K is well approximated by its linearization (here in
the isotropic case Σ = I d )
K ≈ (h(0) + h′′ (0)/(2d))11T + h′ (0)G + (h(1) − h(0) − h′ (0))I n , (B.0.1)
where G = (⟨xi , xj ⟩/d)i,j≤n is the Gram matrix. This result was later used to bound the asymptotic
prediction error of KRR in the proportional regime [LR20, LLS21, BMR21]. In particular, KRR
can learn at most a linear approximation to the target function in this regime.
In order to study a more realistic scenario, where n ≫ d with n, d large, several works have con-
sidered a more general polynomial asymptotic scaling, with n ≍ dκ for fixed κ ∈ R>0 as d, n → ∞
[GMMM21, MMM22, LRZ20, GMMM20, MMM21, MM21, Xia22]. In the case of covariates uni-
formly distributed on the d-dimensional sphere, [GMMM21] generalized Eq. (B.0.1) and showed
that the empirical kernel matrix can be approximated by its degree-⌊κ⌋ polynomial approximation.
Using this decomposition, [GMMM21] showed that for κ ̸∈ N, KRR fits the best degree-⌊κ⌋ poly-
nomial approximation (see Section 3.4 in Chapter 3). [MMM22] extended these results to a more
general KRR setting under abstract conditions on the kernel operator, namely hypercontractivity
of the top eigenfunctions and a spectral gap property (the number of eigenvalues λi such that
dδ /n ≥ λi ≥ d−δ /n is smaller than n1−δ for some δ > 0). In this setting, [MMM22] showed that
KRR effectively acts as a shrinkage operator with some effective regularization λeff > λ. Specifi-
cally, denoting (ψd,j )j≥1 the eigenfunctions and (λd,j )j≥1 the egienvalues in nonincreasing order of

59
the kernel operator Kd , the KRR solution with target function f∗ and regularization λ is given by
∞ ∞
X X λd,j
f∗ (x) = cj ψd,j (x) ⇒ fˆλ = λeff
cj ψd,j (x) + ∆ , (B.0.2)
j=1 j=1 λd,j + n

where ∥∆∥L2 = od,P (1). Hence, KRR essentially fits the projection of the target function on the
top O(n) eigenfunctions with λd,j ≫ λeff /n, and none of the components of f∗ with λd,j ≪ λeff /n.
In the setting of inner product kernels on the sphere [GMMM21], the spectral decay property is
satisfied for n ≍ dκ , κ ̸∈ N. In this case, λeff = Θ(1) and indeed all spherical harmonics of degree
less or equal to ⌊κ⌋ have λd,j ≫ λeff /n, while higher degrees spherical harmonics have λd,j ≪ λeff /n,
and we recover that KRR fits the projection of f∗ onto degree-⌊κ⌋ polynomials.
If the spectral decay property is not satisfied, the estimator (B.0.2) needs to be rescaled by
a constant coming from random matrix theory effects, as shown in Chapter 2. The case n ≍ dℓ ,
ℓ ∈ N for data uniformly distributed on the sphere was studied recently in three concurrent papers
[Mis22, HL22a, XHM+ 22]. They showed that the contribution of degree-ℓ spherical harmonics to
the empirical kernel matrix behaves as a Wishart matrix. In particular, as n approaches dℓ /ℓ!, the
condition number of the Wishart matrix diverges and a peak can appear in the risk curve.
Finally, note that asymptotics for the KRR test error were also heuristically derived either
by conjecturing an equivalence with Gaussian feature models [JSS+ 20] or using statistical physics
heuristics [CBP21, CLKZ21, CLKZ22].

Chapter 4 (Random Feature model)


The random feature (RF) model was initially introduced as a finite-rank randomized approx-
imation to kernel methods [BBV06, RR07]. The connection between neural networks and RF
models was originally pointed out by [Nea95, Wil96] and has recently attracted renewed attention
through the NTK and Gaussian process descriptions of wide neural networks [NXB+ 18, DMHR+ 18,
LBN+ 18]. [RR07] showed that the empirical random feature kernel converges pointwise to the
asymptotic kernel (N = ∞). Note that this pointwise convergence does not provide any con-
trol on the performance of RF when both N, n are allowed to grow together. Subsequent works
[RR08, Bac17b, RR17] derived bounds on the approximation √ and generalization errors of RF mod-

els. [RR08] proved a minimax upper bound O(1/ n + 1/ N ) on the generalization error, but their
results are limited to Lipschitz losses and require a stringent condition on ∥a∥∞ . More recently,

[RR17] studied the case of the square loss and proved that for f∗ in the RKHS , N = C n log n
random features are sufficient to achieve the same error rate O(n−1/2 ) as the limiting KRR. This
is in contrast to the requirement N = O(n1+δ ) described in these notes, the core difference lying
in the assumption on the target function f∗ (see discussion in [MMM22]). Furthermore, results
in [RR17] required positive ridge regularization and only characterized the minimax error rate as
n → ∞ for a fixed RKHS (fixed d). In these notes, we will consider instead studying the RF model
in high dimension when N, n, d all grow together.
A number of works [MM19, LCM20, AP20, ALP22, MMM22] have studied the asymptotic risk
of ridge regression with random features in high dimension. In particular, [MM19] derives the
complete analytical predictions for the test error in the proportional regime n ≍ d and N ≍ d,
in the case of covariates and weights both uniformly distributed on the d-dimensional sphere.
These asymptotics capture in detail the double descent phenomenon in this model. In particular,
[MM19] provides precise conditions for the highly overaparametrized (N/n → ∞) and interpolating
(λ → 0+) solution to be optimal among random feature models. The polynomial scaling for

60
random feature models was first considered in [GMMM21] which focused on the approximation
error (n = ∞) with data and weights uniformly distributed on the sphere. They show that RF
models with dℓ+δ ≤ N ≤ dℓ+1−δ random features can only approximate the projection of the target
function on degree-ℓ polynomials. The generalization error was studied in a subsequent work
[MMM22]. They consider abstract conditions on the eigendecomposition of the activation function,
namely hypercontractivity of the top eigenfunctions, an eigenvalue gap (i.e. there exists m ≤ d−δ s,
where s = min(N, n), such that λd,m ≥ dδ /s and λd,m+1 ≤ d−δ /s for some constant δ > 0) and
max(N/n, n/N ) ≥ dδ . Under these assumptions, [MMM22] shows that RF ridge regression fits
exactly the projection of the target function on the top m ≤ min(n, N ) eigenfunctions. Applying
these results to covariates and weights uniformly distributed on the sphere, this shows that for
dℓ1 +δ ≤ N ≤ dℓ1 −δ , dℓ2 +δ ≤ n ≤ dℓ2 −δ and max(N/n, n/N ) ≥ dδ , RF ridge regression fits the
degree-min(ℓ1 , ℓ2 ) polynomial approximation to f∗ .
Finally, in certain cases (for instance, ridge regression), the non-linear random feature model is
connected through universality to a simpler linear Gaussian covariate model [GLK+ 20, GLR+ 22,
HL22b, MS22].

Chapter 5 (Neural Tangent model)


In the linear regime, the optimization and generalization properties of neural networks depend
crucially on the empirical neural tangent kernel matrix and in particular, its smallest eigenvalue. For
example, [DZPS18] showed that for two layer neural networks with width N ≥ Cn6 /λmin (K n )4 ,
where K n is the infinite-width empirical NTK matrix (N = ∞), the empirical kernel is well
conditioned throughout training and neural networks converge to zero training error. A number
of works have studied the empirical kernel matrices arising from the neural tangent model [FW20,
LZB20, OS20, ZCZG20].
The approximation error of the neural tangent model associated to two-layer neural networks
was studied in [GMMM21] for covariates and first layer weights uniformly distributed on the d-
dimensional sphere. They prove that for dℓ−1+δ ≤ N ≤ dℓ−δ , the neural tangent model exactly fits
the best degree-ℓ polynomial approximation to the target function. Namely, the neural tangent
model has the same approximation power as the random feature model, provided we match the
number of parameters p = N d for NT model and p = N for RF model. Note that this equivalence
does not hold when data is anisotropic, e.g. see [GMMM20]. The generalization error of the
neural tangent model was studied in a subsequent paper [MZ22]. They first show that as long as
N d/ log(N d)C ≥ n, the empirical neural tangent kernel has eigenvalues bounded away from zero,
and the neural tangent model can exactly interpolate arbitrary labels. In the same regime, [MZ22]
prove that ridge regression with the neural tangent model is well approximated by kernel ridge
regression with the infinite width kernel.

Chapter 6 (Beyond the linear regime)


Several studies have empirically investigated the relevance of the linear regime to describe
neural networks used in practice. They show that, in some settings, the neural tangent model
captures the behavior of SGD-trained neural networks, at least at the beginning of the dynamics
[LXS+ 19, GSJW20, WGL+ 20]. Furthermore, it was shown that in some settings NTK can achieve
superior performance compared to standard kernels [ADH+ 19b, LWY+ 19, SFG+ 20] and even state-
of-the-art performance on some datasets [ADL+ 19]. However, neural networks in the linear regime
and kernel methods typically fall short in comparison to state-of-the-art neural networks. The

61
scenario in which weights move significantly away from their random initialization is referred as
rich or feature learning regime.
A major theoretical achievement of the nineties [DHM89, Hor91, Bar93, Pin99] was to prove the
approximation theoretic advantage of non-linear neural networks compared to fixed basis methods.
In particular, Barron showed in his celebrated work [Bar93] that two-layer neural networks can
approximate, with few neurons, functions with fast decay of their high frequency components,
while linear methods need a number of parameters exponential in the dimension to approximate
the same class of functions (in worst case). More recent works have established tighter lower
bounds —that are cursed by dimensionality— on the performance of kernel and random feature
models [Bac17a, YS19, VW19, GMMM21]. For example, [YS19] proves that a super-polynomial in
d number of random features is required to approximate a single neuron.
While approximation theory deals with ideal representations —which might not be tractable
to find in practice—, a recent line of work have sought to display theoretical settings where neu-
ral networks trained by gradient descent provably outperform neural tangent and kernel models
[Bac17a, WGL+ 19, AZL19, AZL20, GMMM19, GMMM20, FDZ19, CB20, DM20]. In particular,
much attention has been devoted to the setting of learning ridge functions, i.e. functions that only
depend on a small number of (unknown) relevant directions f∗ (x) = ψ(U T T
∗ x) with U ∗ U ∗ = I k
[CBL+ 20, RGKZ21, AAM22, DLS22, MHPG+ 22, BBSS22, AAM23, BMZ23]. These functions are
also known in the literature as single-index models in the case k = 1 and multiple-index models for
k > 1. The reason for this interest is that learning ridge functions offer a simple setting where a
clear separation exists between neural networks trained non-linearly and kernel methods. Indeed,
linear methods are oblivious to latent linear structures [Bac17a, GMMM21, AAM22], while gradient
descent has the possibility to align the network weights with the sparse support. The picture that
emerges from these works is as follows. The complexity of learning single index models by gradient
descent is driven by the information exponent [AGJ21], which measures the strength of the corre-
lation between a single neuron and the ridge function at initialization. This notion of information
exponent was generalized to multi-index models via the ‘leap complexity’ [AAM22, AAM23].
Hence, while the lazy regime offers a setting where both the optimization and generalization
properties of neural networks can be understood precisely, it does not capture the full power of deep
learning. A number of approaches have been suggested to go beyond the linear regime. Several
groups have proposed higher-order or finite-width corrections to the limiting neural tangent model
[HY20, BL19, HN19, DL20]. Another approach examines other infinite-width limits for gradient-
trained neural networks. A systematic analysis of the different infinite-widths limits was conducted
in [YH20], which characterized all non-trivial limits following an ‘abc-parametrization’ of the scaling
of learning rate, model parameters and initialization. In particular, they put forward a new maximal
update parametrization (µP ) which maximizes the change in the network weights after one SGD
step among all infinite-width limits.
A popular alternative to the linear regime corresponds to two-layer neural networks trained by
stochastic gradient descent in the mean-field regime [CB18, MMN18, RVE18, SS20, MMM19]. This
limit coincides with the µP limit for two-layer networks. Unlike the neural tangent approach, the
evolution of network weights is now non-linear and described in terms of a Wasserstein gradient
flow over the neurons’ weight distribution. However, analyzing the training dynamics requires to
track the evolution of a distribution and remains challenging, except in highly-symmetric settings
where the PDE is effectively low-dimensional [MMN18, AAM23, BMZ23]. Finally, several papers
have proposed extensions of the mean-field limit to multilayer neural networks [NP20, LML+ 20].

62
Bibliography

[AAM22] Emmanuel Abbe, Enric Boix Adsera, and Theodor Misiakiewicz, The merged-staircase
property: a necessary and nearly sufficient condition for sgd learning of sparse func-
tions on two-layer neural networks, Conference on Learning Theory, PMLR, 2022,
pp. 4782–4887.

[AAM23] Emmanuel Abbe, Enric Boix Adserà, and Theodor Misiakiewicz, Sgd learning on
neural networks: leap complexity and saddle-to-saddle dynamics, The Thirty Sixth
Annual Conference on Learning Theory, PMLR, 2023, pp. 2552–2623.

[ADH+ 19a] Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang, Fine-grained
analysis of optimization and generalization for overparameterized two-layer neural
networks, International Conference on Machine Learning, PMLR, 2019, pp. 322–332.

[ADH+ 19b] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong
Wang, On exact computation with an infinitely wide neural net, Advances in Neural
Information Processing Systems 32 (2019).

[ADL+ 19] Sanjeev Arora, Simon S Du, Zhiyuan Li, Ruslan Salakhutdinov, Ruosong Wang, and
Dingli Yu, Harnessing the power of infinitely wide deep nets on small-data tasks,
International Conference on Learning Representations, 2019.

[AGJ21] Gerard Ben Arous, Reza Gheissari, and Aukosh Jagannath, Online stochastic gradient
descent on non-convex losses from high-dimensional inference, The Journal of Machine
Learning Research 22 (2021), no. 1, 4788–4838.

[ALP22] Ben Adlam, Jake A Levinson, and Jeffrey Pennington, A random matrix perspective on
mixtures of nonlinearities in high dimensions, International Conference on Artificial
Intelligence and Statistics, PMLR, 2022, pp. 3434–3457.

[AP20] Ben Adlam and Jeffrey Pennington, The neural tangent kernel in high dimensions:
Triple descent and a multi-scale theory of generalization, International Conference on
Machine Learning, PMLR, 2020, pp. 74–84.

[AS17] Madhu S Advani and Andrew M Saxe, High-dimensional dynamics of generalization


error in neural networks, arXiv preprint arXiv:1710.03667 (2017).

[AZL19] Zeyuan Allen-Zhu and Yuanzhi Li, What can resnet learn efficiently, going beyond
kernels?, Proceedings of the 33rd International Conference on Neural Information
Processing Systems, 2019, pp. 9017–9028.

63
[AZL20] , Backward feature correction: How deep learning performs deep learning,
arXiv:2001.04413 (2020).

[AZLS19a] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song, A convergence theory for deep learning
via over-parameterization, International Conference on Machine Learning, PMLR,
2019, pp. 242–252.

[AZLS19b] , On the convergence rate of training recurrent neural networks, Advances in


Neural Information Processing Systems 32 (2019), 6676–6688.

[Bac17a] Francis Bach, Breaking the curse of dimensionality with convex neural networks, The
Journal of Machine Learning Research 18 (2017), no. 1, 629–681.

[Bac17b] , On the equivalence between kernel quadrature rules and random feature ex-
pansions, The Journal of Machine Learning Research 18 (2017), no. 1, 714–751.

[Bar93] Andrew R Barron, Universal approximation bounds for superpositions of a sigmoidal


function, IEEE Transactions on Information theory 39 (1993), no. 3, 930–945.

[BBM05] Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson, Local rademacher com-
plexities.

[BBP22] Daniel Beaglehole, Mikhail Belkin, and Parthe Pandit, Kernel ridgeless regression is
inconsistent for low dimensions, arXiv preprint arXiv:2205.13525 (2022).

[BBSS22] Alberto Bietti, Joan Bruna, Clayton Sanford, and Min Jae Song, Learning single-index
models with shallow neural networks, Advances in Neural Information Processing Sys-
tems 35 (2022), 9768–9783.

[BBV06] Maria-Florina Balcan, Avrim Blum, and Santosh Vempala, Kernels as features: On
kernels, margins, and low-dimensional mappings, Machine Learning 65 (2006), 79–94.

[BHMM19] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal, Reconciling modern
machine-learning practice and the classical bias–variance trade-off, Proceedings of
the National Academy of Sciences 116 (2019), no. 32, 15849–15854.

[BL19] Yu Bai and Jason D Lee, Beyond linearization: On quadratic and higher-order ap-
proximation of wide neural networks, International Conference on Learning Represen-
tations, 2019.

[BLLT20] Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler, Benign over-
fitting in linear regression, Proceedings of the National Academy of Sciences 117
(2020), no. 48, 30063–30070.

[BMM18] Mikhail Belkin, Siyuan Ma, and Soumik Mandal, To understand deep learning we
need to understand kernel learning, International Conference on Machine Learning,
PMLR, 2018, pp. 541–549.

[BMR21] Peter L Bartlett, Andrea Montanari, and Alexander Rakhlin, Deep learning: a sta-
tistical viewpoint, Acta numerica 30 (2021), 87–201.

64
[BMZ23] Raphaël Berthier, Andrea Montanari, and Kangjie Zhou, Learning time-scales in two-
layers neural networks, arXiv:2303.00055 (2023).

[BRT19] Mikhail Belkin, Alexander Rakhlin, and Alexandre B Tsybakov, Does data interpola-
tion contradict statistical optimality?, The 22nd International Conference on Artificial
Intelligence and Statistics, PMLR, 2019, pp. 1611–1619.

[BTA11] Alain Berlinet and Christine Thomas-Agnan, Reproducing kernel hilbert spaces in
probability and statistics, Springer Science & Business Media, 2011.

[CB18] Lénaı̈c Chizat and Francis Bach, On the global convergence of gradient descent for
over-parameterized models using optimal transport, Advances in Neural Information
Processing Systems 31 (2018), 3036–3046.

[CB20] Lenaic Chizat and Francis Bach, Implicit bias of gradient descent for wide two-layer
neural networks trained with the logistic loss, Conference on Learning Theory, PMLR,
2020, pp. 1305–1338.

[CBL+ 20] Minshuo Chen, Yu Bai, Jason D Lee, Tuo Zhao, Huan Wang, Caiming Xiong, and
Richard Socher, Towards understanding hierarchical learning: Benefits of neural rep-
resentations, arXiv:2006.13436 (2020).

[CBP21] Abdulkadir Canatar, Blake Bordelon, and Cengiz Pehlevan, Spectral bias and task-
model alignment explain generalization in kernel regression and infinitely wide neural
networks, Nature communications 12 (2021), no. 1, 2914.

[CDV07] Andrea Caponnetto and Ernesto De Vito, Optimal rates for the regularized least-
squares algorithm, Foundations of Computational Mathematics 7 (2007), no. 3, 331–
368.

[CFW+ 21] Yuan Cao, Zhiying Fang, Yue Wu, Ding-Xuan Zhou, and Quanquan Gu, Towards
understanding the spectral bias of deep learning, Proceedings of the Thirtieth Inter-
national Joint Conference on Artificial Intelligence, IJCAI-21, 2021, pp. 2205–2211.

[CG19] Yuan Cao and Quanquan Gu, Generalization bounds of stochastic gradient descent for
wide and deep neural networks, Advances in Neural Information Processing Systems
32 (2019), 10836–10846.

[Chi11] Theodore S Chihara, An introduction to orthogonal polynomials, Courier Corporation,


2011.

[CLKZ21] Hugo Cui, Bruno Loureiro, Florent Krzakala, and Lenka Zdeborová, Generalization
error rates in kernel regression: The crossover from the noiseless to noisy regime,
Advances in Neural Information Processing Systems 34 (2021), 10131–10143.

[CLKZ22] , Error rates for kernel classification under source and capacity conditions,
arXiv:2201.12655 (2022).

[CM22] Chen Cheng and Andrea Montanari, Dimension free ridge regression,
arXiv:2210.08571 (2022).

65
[COB19] Lenaic Chizat, Edouard Oyallon, and Francis Bach, On lazy training in differentiable
programming, NeurIPS 2019-33rd Conference on Neural Information Processing Sys-
tems, 2019, pp. 2937–2947.

[CS09] Youngmin Cho and Lawrence Saul, Kernel methods for deep learning, Advances in
neural information processing systems 22 (2009).

[DFS16] Amit Daniely, Roy Frostig, and Yoram Singer, Toward deeper understanding of neural
networks: The power of initialization and a dual view on expressivity, Advances in
neural information processing systems 29 (2016).

[DHM89] Ronald A DeVore, Ralph Howard, and Charles Micchelli, Optimal nonlinear approx-
imation, Manuscripta mathematica 63 (1989), 469–478.

[Dic16] Lee H Dicker, Ridge regression and asymptotic minimax estimation over spheres of
growing dimension.

[DL20] Xialiang Dou and Tengyuan Liang, Training neural networks as learning data-adaptive
kernels: Provable representation and approximation benefits, Journal of the American
Statistical Association (2020), 1–14.

[DLL+ 19] Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai, Gradient descent
finds global minima of deep neural networks, International Conference on Machine
Learning, PMLR, 2019, pp. 1675–1685.

[DLS22] Alexandru Damian, Jason Lee, and Mahdi Soltanolkotabi, Neural networks can learn
representations with gradient descent, Conference on Learning Theory, PMLR, 2022,
pp. 5413–5452.

[DM20] Amit Daniely and Eran Malach, Learning parities with neural networks, Advances in
Neural Information Processing Systems 33 (2020).

[DMHR+ 18] AGG De Matthews, J Hron, M Rowland, RE Turner, and Z Ghahramani, Gaus-
sian process behaviour in wide deep neural networks, 6th International Conference on
Learning Representations, ICLR 2018-Conference Track Proceedings, 2018.

[DW18] Edgar Dobriban and Stefan Wager, High-dimensional asymptotics of prediction: Ridge
regression and classification, The Annals of Statistics 46 (2018), no. 1, 247–279.

[DZPS18] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh, Gradient descent provably
optimizes over-parameterized neural networks, International Conference on Learning
Representations, 2018.

[EF14] Costas Efthimiou and Christopher Frye, Spherical harmonics in p dimensions, World
Scientific, 2014.

[EK10] Noureddine El Karoui, The spectrum of kernel random matrices, The Annals of Statis-
tics 38 (2010), no. 1, 1–50.

[FDZ19] Cong Fang, Hanze Dong, and Tong Zhang, Over parameterized two-level neural net-
works can learn near optimal feature representations, arXiv:1910.11508 (2019).

66
[FW20] Zhou Fan and Zhichao Wang, Spectra of the conjugate kernel and neural tangent kernel
for linear-width neural networks, Advances in neural information processing systems
33 (2020), 7710–7721.

[GLK+ 20] Federica Gerace, Bruno Loureiro, Florent Krzakala, Marc Mézard, and Lenka Zde-
borová, Generalisation error in learning with random features and the hidden manifold
model, International Conference on Machine Learning, PMLR, 2020, pp. 3452–3462.

[GLR+ 22] Sebastian Goldt, Bruno Loureiro, Galen Reeves, Florent Krzakala, Marc Mézard,
and Lenka Zdeborová, The gaussian equivalence of generative models for learning
with shallow neural networks, Mathematical and Scientific Machine Learning, PMLR,
2022, pp. 426–471.

[GLSS18a] Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro, Characterizing im-
plicit bias in terms of optimization geometry, International Conference on Machine
Learning, PMLR, 2018, pp. 1832–1841.

[GLSS18b] Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro, Implicit bias of
gradient descent on linear convolutional networks, Advances in neural information
processing systems 31 (2018).

[GMMM19] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari, Limita-
tions of lazy training of two-layers neural networks, Proceedings of the 33rd Interna-
tional Conference on Neural Information Processing Systems, 2019, pp. 9111–9121.

[GMMM20] , When do neural networks outperform kernel methods?, Advances in Neural


Information Processing Systems 33 (2020), 14820–14830.

[GMMM21] , Linearized two-layers neural networks in high dimension, The Annals of


Statistics 49 (2021), no. 2, 1029–1054.

[GSJW20] Mario Geiger, Stefano Spigler, Arthur Jacot, and Matthieu Wyart, Disentangling
feature and lazy training in deep neural networks, Journal of Statistical Mechanics:
Theory and Experiment 2020 (2020), no. 11, 113301.

[GWB+ 17] Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur,
and Nati Srebro, Implicit regularization in matrix factorization, Advances in neural
information processing systems 30 (2017).

[HL22a] Hong Hu and Yue M Lu, Sharp asymptotics of kernel ridge regression beyond the
linear regime, arXiv:2205.06798 (2022).

[HL22b] , Universality laws for high-dimensional learning with random features, IEEE
Transactions on Information Theory (2022).

[HMRT22] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani, Surprises
in high-dimensional ridgeless least squares interpolation, The Annals of Statistics 50
(2022), no. 2, 949–986.

[HN19] Boris Hanin and Mihai Nica, Finite depth and width corrections to the neural tangent
kernel, arXiv:1909.05989 (2019).

67
[Hor91] Kurt Hornik, Approximation capabilities of multilayer feedforward networks, Neural
networks 4 (1991), no. 2, 251–257.

[HY20] Jiaoyang Huang and Horng-Tzer Yau, Dynamics of deep neural networks and neu-
ral tangent hierarchy, International Conference on Machine Learning, PMLR, 2020,
pp. 4542–4551.

[JGH18] Arthur Jacot, Franck Gabriel, and Clément Hongler, Neural tangent kernel: Conver-
gence and generalization in neural networks, Advances in neural information process-
ing systems, 2018, pp. 8571–8580.

[JSS+ 20] Arthur Jacot, Berfin Simsek, Francesco Spadaro, Clément Hongler, and Franck
Gabriel, Kernel alignment risk estimator: Risk prediction from training data, Ad-
vances in Neural Information Processing Systems 33 (2020), 15568–15578.

[JT18a] Ziwei Ji and Matus Telgarsky, Gradient descent aligns the layers of deep linear net-
works, arXiv:1810.02032 (2018).

[JT18b] , Risk and parameter convergence of logistic regression, arXiv:1803.07300


(2018).

[JT19] , Polylogarithmic width suffices for gradient descent to achieve arbitrarily small
test error with shallow relu networks, International Conference on Learning Represen-
tations, 2019.

[JT20] , Directional convergence and alignment in deep learning, Advances in Neural


Information Processing Systems 33 (2020), 17176–17186.

[Kac56] Mark Kac, Foundations of kinetic theory, Proceedings of The third Berkeley sympo-
sium on mathematical statistics and probability, vol. 3, 1956, pp. 171–197.

[KH19] Kenji Kawaguchi and Jiaoyang Huang, Gradient descent finds global minima for gen-
eralizable deep neural networks of practical sizes, 2019 57th Annual Allerton Confer-
ence on Communication, Control, and Computing (Allerton), IEEE, 2019, pp. 92–99.

[KY17] Antti Knowles and Jun Yin, Anisotropic local laws for random matrices, Probability
Theory and Related Fields 169 (2017), 257–352.

[KZSS21] Frederic Koehler, Lijia Zhou, Danica J Sutherland, and Nathan Srebro, Uniform
convergence of interpolators: Gaussian width, norm bounds and benign overfitting,
Advances in Neural Information Processing Systems 34 (2021), 20657–20668.

[LBN+ 18] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Penning-
ton, and Jascha Sohl-Dickstein, Deep neural networks as gaussian processes, Interna-
tional Conference on Learning Representations, 2018.

[LCM20] Zhenyu Liao, Romain Couillet, and Michael W Mahoney, A random matrix analysis
of random fourier features: beyond the gaussian kernel, a precise phase transition, and
the corresponding double descent, Advances in Neural Information Processing Systems
33 (2020), 13939–13950.

68
[LL18] Yuanzhi Li and Yingyu Liang, Learning overparameterized neural networks via
stochastic gradient descent on structured data, Advances in Neural Information Pro-
cessing Systems, 2018, pp. 8157–8166.

[LL19] Kaifeng Lyu and Jian Li, Gradient descent maximizes the margin of homogeneous
neural networks, arXiv:1906.05890 (2019).

[LLL20] Zhiyuan Li, Yuping Luo, and Kaifeng Lyu, Towards resolving the implicit bias of
gradient descent for matrix factorization: Greedy low-rank learning, arXiv:2012.09839
(2020).

[LLS21] Fanghui Liu, Zhenyu Liao, and Johan Suykens, Kernel regression in high dimensions:
Refined analysis beyond double descent, International Conference on Artificial Intelli-
gence and Statistics, PMLR, 2021, pp. 649–657.

[LML+ 20] Yiping Lu, Chao Ma, Yulong Lu, Jianfeng Lu, and Lexing Ying, A mean field analysis
of deep resnet and beyond: Towards provably optimization via overparameterization
from depth, International Conference on Machine Learning, PMLR, 2020, pp. 6426–
6436.

[LR20] Tengyuan Liang and Alexander Rakhlin, Just interpolate: Kernel “ridgeless” regres-
sion can generalize, The Annals of Statistics 48 (2020), no. 3, 1329–1347.

[LRZ20] Tengyuan Liang, Alexander Rakhlin, and Xiyu Zhai, On the multiple descent of
minimum-norm interpolants and restricted lower isometry of kernels, Conference on
Learning Theory, PMLR, 2020, pp. 2683–2711.

[LS75] BF Logan and Larry A Shepp, Optimal reconstruction of a function from its projec-
tions, Duke Math. J. 42 (1975), no. 1, 645–659.

[LWY+ 19] Zhiyuan Li, Ruosong Wang, Dingli Yu, Simon S Du, Wei Hu, Ruslan Salakhutdinov,
and Sanjeev Arora, Enhanced convolutional neural tangent kernels, arXiv:1911.00809
(2019).

[LXS+ 19] Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha
Sohl-Dickstein, and Jeffrey Pennington, Wide neural networks of any depth evolve
as linear models under gradient descent, Advances in neural information processing
systems 32 (2019), 8572–8583.

[LZB20] Chaoyue Liu, Libin Zhu, and Mikhail Belkin, On the linearity of large non-linear
models: when and why the tangent kernel is constant, Advances in Neural Information
Processing Systems 33 (2020).

[MBM18] Song Mei, Yu Bai, and Andrea Montanari, The landscape of empirical risk for non-
convex losses, The Annals of Statistics 46 (2018), no. 6A, 2747–2774.

[MHPG+ 22] Alireza Mousavi-Hosseini, Sejun Park, Manuela Girotti, Ioannis Mitliagkas, and Mu-
rat A Erdogdu, Neural networks efficiently learn low-dimensional representations with
sgd, arXiv:2209.14863 (2022).

69
[Mis22] Theodor Misiakiewicz, Spectrum of inner-product kernel matrices in the polynomial
regime and multiple descent phenomenon in kernel ridge regression, arXiv:2204.10425
(2022).

[MJ66] Henry P McKean Jr, A class of markov processes associated with nonlinear parabolic
equations, Proceedings of the National Academy of Sciences 56 (1966), no. 6, 1907–
1911.

[MM19] Song Mei and Andrea Montanari, The generalization error of random features regres-
sion: Precise asymptotics and the double descent curve, Communications on Pure and
Applied Mathematics (2019).

[MM21] Theodor Misiakiewicz and Song Mei, Learning with convolution and pooling operations
in kernel methods, arXiv:2111.08308 (2021).

[MMM19] Song Mei, Theodor Misiakiewicz, and Andrea Montanari, Mean-field theory of two-
layers neural networks: dimension-free bounds and kernel limit, Conference on Learn-
ing Theory, PMLR, 2019, pp. 2388–2464.

[MMM21] , Learning with invariances in random features and kernel models, Conference
on Learning Theory, PMLR, 2021, pp. 3351–3418.

[MMM22] , Generalization error of random feature and kernel methods: hypercontractiv-


ity and kernel matrix concentration, Applied and Computational Harmonic Analysis
59 (2022), 3–84.

[MMN18] Song Mei, Andrea Montanari, and Phan-Minh Nguyen, A mean field view of the land-
scape of two-layer neural networks, Proceedings of the National Academy of Sciences
115 (2018), no. 33, E7665–E7671.

[MRSY19] Andrea Montanari, Feng Ruan, Youngtak Sohn, and Jun Yan, The generalization
error of max-margin linear classifiers: High-dimensional asymptotics in the over-
parametrized regime, arXiv:1911.01544 (2019).

[MS22] Andrea Montanari and Basil N Saeed, Universality of empirical risk minimization,
Conference on Learning Theory, PMLR, 2022, pp. 4310–4312.

[MVSS20] Vidya Muthukumar, Kailas Vodrahalli, Vignesh Subramanian, and Anant Sahai,
Harmless interpolation of noisy data in regression, IEEE Journal on Selected Areas
in Information Theory 1 (2020), no. 1, 67–83.

[MZ22] Andrea Montanari and Yiqiao Zhong, The interpolation phase transition in neural net-
works: Memorization and generalization under lazy training, The Annals of Statistics
50 (2022), no. 5, 2816–2847.

[NBMS17] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro, Ex-
ploring generalization in deep learning, Advances in neural information processing
systems 30 (2017).

70
[NCS19] Atsushi Nitanda, Geoffrey Chinot, and Taiji Suzuki, Gradient descent can
learn less over-parameterized two-layer neural networks on classification problems,
arXiv:1905.09870 (2019).

[Nea95] Radford M Neal, Bayesian learning for neural networks, Ph.D. thesis, Citeseer, 1995.

[NP20] Phan-Minh Nguyen and Huy Tuan Pham, A rigorous framework for the mean field
limit of multilayer neural networks, arXiv:2001.11443 (2020).

[NTS14] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro, In search of the real in-
ductive bias: On the role of implicit regularization in deep learning, arXiv:1412.6614
(2014).

[NXB+ 18] Roman Novak, Lechao Xiao, Yasaman Bahri, Jaehoon Lee, Greg Yang, Jiri Hron,
Daniel A Abolafia, Jeffrey Pennington, and Jascha Sohl-dickstein, Bayesian deep con-
volutional networks with many channels are gaussian processes, International Confer-
ence on Learning Representations, 2018.

[OS19] Samet Oymak and Mahdi Soltanolkotabi, Overparameterized nonlinear learning: Gra-
dient descent takes the shortest path?, International Conference on Machine Learning,
PMLR, 2019, pp. 4951–4960.

[OS20] , Toward moderate overparameterization: Global convergence guarantees for


training shallow neural networks, IEEE Journal on Selected Areas in Information
Theory 1 (2020), no. 1, 84–105.

[Pin99] Allan Pinkus, Approximation theory of the mlp model in neural networks, Acta nu-
merica 8 (1999), 143–195.

[RGKZ21] Maria Refinetti, Sebastian Goldt, Florent Krzakala, and Lenka Zdeborová, Classifying
high-dimensional gaussian mixtures: Where kernel methods fail and neural networks
succeed, arXiv:2102.11742 (2021).

[RMR21] Dominic Richards, Jaouad Mourtada, and Lorenzo Rosasco, Asymptotics of ridge
(less) regression under general source condition, International Conference on Artificial
Intelligence and Statistics, PMLR, 2021, pp. 3889–3897.

[RR07] Ali Rahimi and Benjamin Recht, Random features for large-scale kernel machines,
Advances in neural information processing systems 20 (2007).

[RR08] , Weighted sums of random kitchen sinks: Replacing minimization with ran-
domization in learning, Advances in neural information processing systems 21 (2008).

[RR17] Alessandro Rudi and Lorenzo Rosasco, Generalization properties of learning with ran-
dom features., NIPS, 2017, pp. 3215–3225.

[RVE18] Grant M Rotskoff and Eric Vanden-Eijnden, Neural networks as interacting parti-
cle systems: Asymptotic convexity of the loss landscape and universal scaling of the
approximation error, stat 1050 (2018), 22.

71
[RZ19] Alexander Rakhlin and Xiyu Zhai, Consistency of interpolation with laplace kernels
is a high-dimensional phenomenon, Conference on Learning Theory, PMLR, 2019,
pp. 2595–2623.
[SFG+ 20] Vaishaal Shankar, Alex Fang, Wenshuo Guo, Sara Fridovich-Keil, Jonathan Ragan-
Kelley, Ludwig Schmidt, and Benjamin Recht, Neural kernels without tangents, In-
ternational Conference on Machine Learning, PMLR, 2020, pp. 8614–8623.
[Sha22] Ohad Shamir, The implicit bias of benign overfitting, Conference on Learning Theory,
PMLR, 2022, pp. 448–478.
[SHN+ 18] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Sre-
bro, The implicit bias of gradient descent on separable data, The Journal of Machine
Learning Research 19 (2018), no. 1, 2822–2878.
[SS20] Justin Sirignano and Konstantinos Spiliopoulos, Mean field analysis of neural net-
works: A central limit theorem, Stochastic Processes and their Applications 130
(2020), no. 3, 1820–1852.
[SY19] Zhao Song and Xin Yang, Quadratic suffices for over-parametrization via matrix cher-
noff bound, arXiv:1906.03593 (2019).
[Sze39] Gabor Szeg, Orthogonal polynomials, vol. 23, American Mathematical Soc., 1939.
[TB20] Alexander Tsigler and Peter L Bartlett, Benign overfitting in ridge regression,
arXiv:2009.14286 (2020).
[Ver18] Roman Vershynin, High-dimensional probability: An introduction with applications in
data science, vol. 47, Cambridge university press, 2018.
[VW19] Santosh Vempala and John Wilmes, Gradient descent for one-hidden-layer neural net-
works: Polynomial convergence and sq lower bounds, Conference on Learning Theory,
PMLR, 2019, pp. 3115–3117.
[Wai19] Martin J Wainwright, High-dimensional statistics: A non-asymptotic viewpoint,
vol. 48, Cambridge university press, 2019.
[WGL+ 19] Blake Woodworth, Suriya Genesekar, Jason Lee, Daniel Soudry, and Nathan Srebro,
Kernel and deep regimes in overparametrized models, Conference on Learning Theory
(COLT), 2019.
[WGL+ 20] Blake Woodworth, Suriya Gunasekar, Jason D Lee, Edward Moroshko, Pedro
Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro, Kernel and rich regimes
in overparametrized models, Conference on Learning Theory, PMLR, 2020, pp. 3635–
3673.
[Wil96] Christopher Williams, Computing with infinite networks, Advances in neural informa-
tion processing systems 9 (1996).
[WX20] Denny Wu and Ji Xu, On the optimal weighted ℓ2 regularization in overparameter-
ized linear regression, Advances in Neural Information Processing Systems 33 (2020),
10112–10123.

72
[XHM+ 22] Lechao Xiao, Hong Hu, Theodor Misiakiewicz, Yue Lu, and Jeffrey Pennington, Pre-
cise learning curves and higher-order scalings for dot-product kernel regression, Ad-
vances in Neural Information Processing Systems 35 (2022), 4558–4570.

[Xia22] Lechao Xiao, Eigenspace restructuring: a principle of space and frequency in neural
networks, Conference on Learning Theory, PMLR, 2022, pp. 4888–4944.

[YH20] Greg Yang and Edward J Hu, Feature learning in infinite-width neural networks,
arXiv:2011.14522 (2020).

[YKM20] Chulhee Yun, Shankar Krishnan, and Hossein Mobahi, A unifying view on implicit
bias in training linear neural networks, arXiv:2010.02501 (2020).

[YS19] Gilad Yehudai and Ohad Shamir, On the power and limitations of random features for
understanding neural networks, Advances in Neural Information Processing Systems
32 (2019), 6598–6608.

[ZBH+ 21] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals, Un-
derstanding deep learning (still) requires rethinking generalization, Communications
of the ACM 64 (2021), no. 3, 107–115.

[ZCZG20] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu, Gradient descent optimizes
over-parameterized deep relu networks, Machine Learning 109 (2020), no. 3, 467–492.

[ZG19] Difan Zou and Quanquan Gu, An improved analysis of training over-parameterized
deep neural networks, arXiv:1906.04688 (2019).

73

You might also like