0% found this document useful (0 votes)
39 views24 pages

Computer 5

This document explores loss-to-loss scaling laws in large language models (LLMs), emphasizing the significant influence of pretraining data and tokenizer choices on model performance. It finds that while architectural differences and model size have minimal impact, optimizing pretraining datasets is crucial for achieving optimal downstream performance. The study highlights the importance of understanding these scaling laws to enhance the efficiency and effectiveness of LLM training across various tasks.

Uploaded by

bekendemissie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views24 pages

Computer 5

This document explores loss-to-loss scaling laws in large language models (LLMs), emphasizing the significant influence of pretraining data and tokenizer choices on model performance. It finds that while architectural differences and model size have minimal impact, optimizing pretraining datasets is crucial for achieving optimal downstream performance. The study highlights the importance of understanding these scaling laws to enhance the efficiency and effectiveness of LLM training across various tasks.

Uploaded by

bekendemissie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

Prasanna Mayilvahanan * 1 2 3 4 Thaddäus Wiedemer * 1 2 3 4 Sayak Mallick 1 4 Matthias Bethge 2 3 4


Wieland Brendel 1 2 3

Abstract 1TXXYT1TXX8HFQNSL

Scaling laws guide the development of large lan- 
arXiv:2502.12120v1 [[Link]] 17 Feb 2025

guage models (LLMs) by offering estimates for

-JQQF8\FL9JXY1TXX
the optimal balance of model size, tokens, and 
compute. More recently, loss-to-loss scaling laws 
that relate losses across pretraining datasets and 
downstream tasks have emerged as a powerful  ^"]
tool for understanding and improving LLM perfor-

mance. In this work, we investigate which factors
most strongly influence loss-to-loss scaling. Our 
experiments reveal that the pretraining data and 
       
tokenizer determine the scaling trend. In contrast, +NSJ<JG*IZ;FQNIFYNTS1TXX
model size, optimization hyperparameters, and
even significant architectural differences, such 'FXJ2TIJQ )NKKJWJSY&WHMNYJHYZWJ
)NKKJWJSY9TPJSN_JW )NKKJWJSY(TSYJ]Y1JSLYM
as between transformer-based models like Llama )NKKJWJSY5WJYWFNSNSL)FYF )NKKJWJSY4UYNRN_JW
and state-space models like Mamba, have lim- )NKKJWJSY8N_J
ited impact. Consequently, practitioners should
carefully curate suitable pretraining datasets for Figure 1. LLMs’ loss-to-loss scaling follows power laws pri-
optimal downstream performance, while architec- marily shaped by the choice of pretraining data and tokenizer.
Using Llama trained on FineWeb-Edu as a baseline, we intervene
tures and other settings can be freely optimized
on various factors to assess their impact on train-to-test loss scaling.
for training efficiency. Changing the pretraining data has the largest effect, followed
by the choice of tokenizer. Switching the architecture, e.g., from
Llama to Mamba, has limited impact, while factors like model size,
1. Introduction context length, and optimizer settings exert little-to-no influence.
Scaling laws have long guided Large Language Model
(LLM) pretraining, determining model and data size un-
der a fixed compute budget (Kaplan et al., 2020; Hoffmann Brandfonbrener et al. (2024) show that downstream scaling
et al., 2022; Grattafiori et al., 2024). Typically, scaling laws laws can be decomposed into compute-to-train-loss scal-
relate model performance, usually measured as training or ing laws and (train)-loss-to-(test)-loss scaling laws. The
validation loss, to total compute measured in floating point combination of compute-to-loss and loss-to-loss scaling
operations (FLOPs). FLOPs account for both parameter laws enables efficient and accurate prediction of a model’s
count and the number of training tokens. While useful for downstream performance. Moreover, holistic downstream
pretraining, scaling laws do not capture how well a model ul- scaling laws often optimize for a single task or average per-
timately performs on downstream tasks (Gadre et al., 2024; formance across tasks (Gadre et al., 2024; Schaeffer et al.,
Schaeffer et al., 2024; Du et al., 2025). Consequently, mul- 2024), whereas loss-to-loss (especially test-to-test) scaling
tiple works have begun to investigate downstream scaling laws can help tune a model’s performance across a broader
laws: Scaling laws that directly predict downstream loss range of downstream tasks, e.g., to ensure broad or robust
from FLOPs (Schaeffer et al., 2024; Gadre et al., 2024). generalization.
*
Equal contribution 1 Max Planck Institute for Intelligent Sys- While the impact of design choices like pretraining distri-
tems 2 ELLIS Institute Tübingen 3 Tübingen AI Center 4 University bution, architecture, tokenizer, optimizer settings, etc. on
of Tübingen. Correspondence to: Prasanna Mayilvahanan
<[Link]@[Link]>, Thaddäus Wiedemer compute-to-loss scaling laws is fairly well understood (Ka-
<[Link]@[Link]>. plan et al., 2020; Hoffmann et al., 2022; Tay et al., 2022;
Code is available on our website. Wang et al., 2024; Porian et al., 2025; Du et al., 2025), a sim-

1
LLMs on the Line

ilar understanding is missing for loss-to-loss scaling laws. formance from compute (Gadre et al., 2024; Isik et al.,
To close this gap, we extend the work of Brandfonbrener 2024; Du et al., 2025). While some initial works attempt
et al. (2024); Du et al. (2025), which analyze loss-to-loss to map compute budgets to accuracy on individual tasks,
relationships within a single architectural and training setup. multiple tasks, or aggregate benchmarks, this mapping is
Adding to that, our study systematically explores how mul- usually noisy due to several transformations in the accuracy
tiple factors influence scaling laws across a diverse range of computation that degrade the statistical relationship (Scha-
architectures and training configurations. effer et al., 2024). More recent efforts instead use the
model’s average loss on the correct answers of the task as
Our analysis additionally draws inspiration from a body of
a proxy (Madaan et al., 2024; Brandfonbrener et al., 2024).
work in robustness evaluation of vision (and later language)
Such compute-to-downstream scaling laws provide a more
models (Taori et al., 2020; Miller et al., 2021; Fang et al.,
practical perspective on scaling but are still specific to a
2022; Awadalla et al., 2022). These works show that model
given training setup.
performance on different distributions is frequently strongly
correlated, and most model and training settings have little-
to-no impact on the task-to-task scaling trend of model Loss-to-Loss Scaling Laws Loss-to-loss scaling laws aim
performance. We treat loss-to-loss curves similarly and to improve the transferability of scaling insights between
perform a series of interventions using over 6000 model training setups by examining the relationship between train-
checkpoints to understand what design choices causally ing (or validation) and test losses, between different valida-
affect scaling trends. tion losses, or between different test losses (Brandfonbrener
et al., 2024). This perspective is crucial for several reasons.
First, train-to-train (or validation-to-validation) scaling im-
We make three main observations, illustrated in Fig. 1:
plies how scaling laws transfer across datasets (Brandfon-
brener et al., 2024). Second, incorporating train-to-test (or
1. LLMs’ loss-to-loss scaling consistently follows
validation-to-test) scaling laws alongside compute-to-train
shifted power laws.
scaling laws provides more precise insight into how com-
2. Pretraining data and tokenizer are the most salient pute budgets translate to downstream performance and can
factors for these scaling laws. help study emergent abilities of models (Du et al., 2025).
Third, while compute-to-loss scaling laws often target a
3. In contrast, architecture plays a minor role, while single downstream task or average task performance, train-
model size, context length, and optimizer settings to-test and test-to-test scaling laws can help tune a model’s
have negligible impact on loss-to-loss scaling. performance across diverse tasks, e.g., to foster the develop-
ment of generalist LLMs with a balanced task performance.
Further, we put our observations in the context of down-
stream scaling laws and discuss the relationship between Accuracy on the Line Our work is inspired by robustness
loss-to-loss and compute-to-loss scaling laws. Our results research in image classification. Prior studies (Taori et al.,
indicate that different LLM architectures might encode very 2020; Miller et al., 2021; Fang et al., 2022) demonstrate
similar inductive biases, freeing practitioners to optimize a strong and consistent correlation between in-distribution
architectures for training efficiency without adversely affect- and out-of-distribution (OOD) accuracy across various im-
ing downstream scaling laws. age classification models and settings. We are not the first to
observe the similarity to LLMs, where recent works (Gadre
et al., 2024; Brandfonbrener et al., 2024; Du et al., 2025)
2. From Scaling Laws to Interventions
highlight strong scaling trends (linear or power-law-like)
Compute-to-Train Scaling Laws Scaling laws aim to between losses. However, these studies are typically con-
optimize model size and token allocation within a fixed strained to a single architecture or training setup. In contrast,
compute budget (expressed in FLOPs) by modeling the re- we examine trends across a wide range of architectures and
lationship between parameters, training tokens, and training training conditions (see §4), showing for the first time that
loss (Hestness et al., 2017; Kaplan et al., 2020; Hoffmann loss-to-loss scaling follows consistent laws across settings.
et al., 2022). However, these laws are inherently shaped
by the data distribution, architecture, and optimization set- Robustness Interventions Accuracy-to-accuracy relation-
tings (Tay et al., 2022; Wang et al., 2024; Brandfonbrener ships in the vision, vision-language, and language domains
et al., 2024; Porian et al., 2025), making their application have also been used to study how scaling laws shift under
across setups non-trivial. robustness interventions like dataset size, adversarial train-
ing, architectural details, loss functions, supervision type, or
Compute-to-Downstream Scaling Laws Recent works OOD shifts (Taori et al., 2020; Fang et al., 2022; Awadalla
extend scaling laws to directly predict downstream task per- et al., 2022; Mayilvahanan et al., 2024a;b; Wiedemer et al.,

2
LLMs on the Line

9WFNSYT9WFNS 9WFNSYT9JXY



 .SJKKJHYN[J
.SYJW[JSYNTS
;FQNIFYNTS1TXX


'FXJ2TIJQ

9JXY1TXX


1TXX

 *KKJHYN[J
 .SYJW[JSYNTS
 


     
+NSJ<JG*IZ;FQNIFYNTS1TXX +NSJ<JG*IZ;FQNIFYNTS1TXX
1TXX
;FQNIFYNTS8JY 9JXY8JY
9MJ5NQJ:( &7((MFQQJSLJ Figure 3. Schematic of our causal analysis. Checkpoints of a
7JKNSJ<JG &7(*FX^ base model trained on different numbers of tokens and with dif-
8QNRUFOFRF 4UJS'TTP6& ferent seeds lie on the same loss-to-loss line. Better-performing
( 5.6&
<NSTLWFSIJ models (typically with higher compute) achieve lower loss (to-
-JQQF8\FL wards the bottom left). We intervene on training settings (e.g.,
pretraining data, architecture, etc.) and retrain from scratch, yield-
Figure 2. Loss-to-loss scaling consistently obeys power laws. ing new models that again constitute loss-to-loss lines. An effec-
We extend results from Brandfonbrener et al. (2024) by many tive intervention produces models on a new line; an ineffective
architectures, training settings, and validation/test sets. We show intervention yields models that lie on the line of the base model.
illustrative shifted power laws for Mamba trained on FineWeb-Edu
here; more configurations and test sets can be found in App. B. For
clarity, scatter plots display a random sample of all data points; all As is standard in the recent literature, we report test loss as
points are used to fit the scaling laws. a proxy for downstream performance. Following Brandfon-
brener et al. (2024); Madaan et al. (2024); Schaeffer et al.
(2024), we track the test loss as a model’s loss on only the
2024). For vision-language models, Taori et al. (2020); Fang correct answer given the question as context. This is some-
et al. (2022) find that most interventions do not impact OOD times called the cloze formulation of a task since the model
performance; only increasing data diversity has a significant is essentially evaluated on its ability to fill in blanks.
positive impact. Their findings suggest that curating better Brandfonbrener et al. (2024) predict train-to-train and train-
datasets is crucial for training vision and vision-language to-test scaling laws to follow a shifted power law1
models that generalize broadly.  κ
Ly fpN,D ≈ K · Lx fpN,D − Ex|p + Ey|p , (1)
 
Motivated by these insights, we aim to uncover the factors
determining loss-to-loss scaling to guide practitioners in
developing models for specific downstream performance. where Lx , Ly are the losses on datasets Dx , Dy shown on
Our insights complement the findings from Awadalla et al. the x- and y-axis. fpN,D is a model trained with N parame-
(2022), who show that accuracy-accuracy scaling trends ters on D tokens on the pretraining set Dp , and K and κ are
in comprehension tasks are agnostic to architecture type parameters to be fit. Ex|p , Ey|p are the irreducible errors
(e.g., encoder-only, encoder-decoder, decoder-only) after (i.e., minimum loss) that fp trained on Dp can achieve on
fine-tuning. In contrast to their study, we focus on zero-shot the datasets Dx , Dy .
generalization across a diverse set of tasks, specifically in- Given a model configuration, we obtain 200 to 800 by
vestigating state-of-the-art decoder-only architectures such recording checkpoints throughout training, across model
as GPT (Radford et al., 2019), Llama (Grattafiori et al., sizes, and for various seeds. All points are used to first es-
2024), and Mamba (Gu & Dao, 2024; Dao & Gu, 2024). timate Ex|p , Ey|p from individual compute-to-loss scaling
laws and then fit K, κ.
3. Fitting Loss-to-Loss Scaling Laws Note that to study the impact of various training settings, we
We focus our analysis on train-to-train and train-to-test scal- show losses of the same model on different datasets on the x-
ing. Combined with known compute-to-train scaling laws, and y-axis. This should not be confused with some curves
these loss-to-loss scaling laws paint a complete picture of in Brandfonbrener et al. (2024) that showed the losses of
a model’s downstream performance given a compute bud- 1
Brandfonbrener et al. (2024) do not state a train-to-train scal-
get and characterize a model’s downstream performance ing law for non-compute-matched data. Our form here follows
distribution across tasks (Brandfonbrener et al., 2024). from their Eq. 4 when assuming an irreducible error as in Eqs. 6, 7.

3
LLMs on the Line

&WHMNYJHYZWJ 1QFRF &WHMNYJHYZWJ 1QFRF &WHMNYJHYZWJ 1QFRF &WHMNYJHYZWJ 2FRGF &WHMNYJHYZWJ 2FRGF &WHMNYJHYZWJ ,59
9TPJSN_JW YNPYTPJS 9TPJSN_JW LUY 9TPJSN_JW LUY-+ 9TPJSN_JW YNPYTPJS 9TPJSN_JW LUY 9TPJSN_JW LUYSJT]

&[JWFLJ;FQ1TXX
9WFNSYT9WFNS









&[JWFLJ9JXY1TXX
9WFNSYT9JXY





                
+<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX
5WJYWFNSNSL)FYF
( +NSJ<JG*IZ 9MJ5NQJ 9MJ5NQJ)JIZUJI 9MJ5NQJ:(

Figure 4. Pretraining data has a substantial impact on loss-to-loss scaling laws. Models are matched on architecture and tokenizer.

different compute-matched models on the x- and y-axis. 4. A Causal Analysis of Loss-to-Loss Scaling
With this setup, we can now analyze the loss-to-loss scaling We now perform interventions on the model and training
laws of models trained with different configurations. Brand- configurations to find what factors cause the exact shape of
fonbrener et al. (2024) only showed loss-to-loss scaling a sin- loss-to-loss scaling laws.
gle architecture with a fixed training recipe: Olmo (Groen-
eveld et al., 2024). We extend their analysis by multiple Our basic procedure is outlined in Fig. 3. As mentioned
architectures, pretraining sets, tokenizers, and training set- in §2, our approach is motivated by similar studies in the
tings, all listed in §4. As an illustrative example, we show robustness literature. In contrast to that setting, we here
loss-to-loss scaling for Mamba trained on FineWeb-Edu in lack paired in-distribution and out-of-distribution datasets.
Fig. 2; more examples with additional test sets are listed in Instead, we simply consider all combinations of validation
App. B and throughout §4. and test sets. For ease of visualization when intervening on
the pretraining data, we always show FineWeb-Edu valida-
Overall, across models, datasets, tokenizers, and optimiza- tion loss on the x-axis, even for models trained on different
tion hyperparameters shifted power laws describe loss-to- pretraining distributions. This choice is arbitrary and does
loss scaling well (Eq. (1)). On some datasets, the perfor- not affect our results; see App. C. Similarly, we here report
mance of models with high loss (towards the top right of results for scaling laws of average validation and test loss;
each curve) is not captured perfectly by the power law for- results for individual losses can be found in App. D.
mulation proposed by Brandfonbrener et al. (2024). This is
unsurprising, given that these data points typically represent For our analysis, we consider the impact of pretraining
models in early training stages but might hint at a refined data, tokenizer, architecture, model size, context length, and
formulation of Eq. (1) for the high-loss regime. optimizer settings.

Note also that loss-to-loss scaling follows a power law even


Pretraining Sets Our models are trained on FineWeb-
for datasets on which the model never reaches high accuracy.
Edu (Penedo et al., 2024), C4 (Dodge et al., 2021), and
E.g., Mamba in Fig. 2 never surpasses chance performance
an uncopyrighted version of The Pile dubbed The Pile UC.
on ARC-Challenge and OpenBookQA, yet Eq. (1) describes
Some models from Hugging Face are trained on the orig-
the test loss equally well. This underlines the usefulness of
inal version of The Pile (Gao et al., 2020) and The Pile
loss-to-loss scaling laws to study model behavior.
Deduped (Biderman et al., 2023), a deduplicated version.

Takeaway 1 Across architectures and training settings, Validation Sets Models are evaluated on 5000 sequences
loss-to-loss scaling generally follows a shifted power law sampled from the validation sets of FineWeb-Edu, C4,
as described in Eq. (1) and illustrated in Fig. 2. The Pile, RefinedWeb (Penedo et al., 2023), and SlimPa-
jama (Shen et al., 2024).

4
LLMs on the Line

&WHMNYJHYZWJ 1QFRF &WHMNYJHYZWJ 1QFRF &WHMNYJHYZWJ 1QFRF &WHMNYJHYZWJ 2FRGF &WHMNYJHYZWJ 2FRGF &WHMNYJHYZWJ ,59
5WJYWFNSNSL ( 5WJYWFNSNSL +<*IZ 5WJYWFNSNSL 5NQJ:( 5WJYWFNSNSL ( 5WJYWFNSNSL 5NQJ:( 5WJYWFNSNSL 5NQJ

&[JWFLJ;FQ1TXX


9WFNSYT9WFNS







&[JWFLJ9JXY1TXX
9WFNSYT9JXY





              
+<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX
9TPJSN_JW
LUYSJT] LUY LUY-+ YNPYTPJS

Figure 5. The tokenizer has a moderate impact on loss-to-loss scaling laws. Models are matched on pretraining data and architecture.

Test Sets We use LM Evaluation Harness framework (Gao to train models from scratch, we do not have checkpoints
et al., 2024) to assess model performance on Hel- for all possible combinations of these factors. Instead, we
laSwag (Zellers et al., 2019), COPA (Gordon et al., analyze the effect of an intervention on each factor when
2011), WinoGrande (Sakaguchi et al., 2019), PIQA (Bisk matching models in the two other factors. Note that we
et al., 2019), OpenBookQA (Mihaylov et al., 2018), So- do not have sufficient checkpoints for some Hugging Face
cialIQA (Sap et al., 2019), CommonsenseQA (Talmor et al., models to fit a power law. Nevertheless, in all these cases,
2019), MMLU (Hendrycks et al., 2021), as well as ARC- the available data points follow a clearly discernible trend.
Easy and ARC-Challenge (Clark et al., 2018).
Effect of Pretraining Data Fig. 4 illustrates the sub-
Architectures We train Llama-3 (Grattafiori et al., 2024) stantial impact pretraining data has on loss-to-loss scaling.
with 417 M parameters and Mamba (Gu & Dao, 2024) with Across architectures and compute (in different columns),
420 M parameters using the Lingua framework (Videau changing the pretraining data leads to a large shift in the
et al., 2024), following Chinchilla scaling laws (Hoffmann loss-to-loss curve. The only exception is the last column,
et al., 2022). We supplement our analysis with pretrained where we compare Hugging Face models trained on The
GPT (Black et al., 2021; 2022; Biderman et al., 2023), Pile and a deduplicated version. Models trained on either
Llama (Penedo et al., 2024), and Mamba (Gu & Dao, 2024; version lie on the same curve, suggesting that the deduplica-
Dao & Gu, 2024) variants from Hugging Face (Wolf et al., tion procedure successfully reduced the dataset size while
2020). producing a similar distribution that does not significantly
impact loss-to-loss scaling.
Tokenizers We train Llama and Mamba with either a
tiktoken tokenizer (128 k vocabulary size) or the gpt2 Takeaway 2 With fixed architecture and tokenizer,
tokenizer (50 257 vocabulary size). Pretrained models from changing the pretraining data leads to substantial shifts
Hugging Face use an almost identical GPT-2 tokenizer, in loss-to-loss scaling laws; see Fig. 4.
dubbed gpt2-HF. This version does not explicitly pad
text with beginning and end-of-sequence tokens. A few
Hugging Face GPT models instead use the gpt-neox tok- Effect of Tokenizer Fig. 5 shows that the tokenizer, too,
enizer with a slightly different vocab size of 50 254, which affects loss-to-loss scaling laws, albeit less strongly than
results in a different internal mapping compared to gpt2, pretraining data. It is interesting to see how slight deviations
in the tokenizer can have a pronounced effect, particularly
for train-to-train scaling laws. While the slight vocabulary
4.1. Pretraining Data, Tokenizer, and Architecture
size difference between gpt2-HF and gpt-neox has lit-
First, we jointly examine the effect of pretraining data, ar- tle impact on loss-to-loss scaling (last column), the different
chitecture, and tokenizer. Since we face limited compute handling of special tokens in gpt2 and gpt2-HF does. To

5
LLMs on the Line

5WJYWFNSNSL ( 5WJYWFNSNSL ( 5WJYWFNSNSL +<*IZ 5WJYWFNSNSL +<*IZ 5WJYWFNSNSL 5NQJ:( 5WJYWFNSNSL 5NQJ
9TPJSN_JW LUY 9TPJSN_JW YNPYTPJS 9TPJSN_JW LUY 9TPJSN_JW YNPYTPJS 9TPJSN_JW YNPYTPJS 9TPJSN_JW LUYSJT]

&[JWFLJ;FQ1TXX
9WFNSYT9WFNS









&[JWFLJ9JXY1TXX
9WFNSYT9JXY





                
+<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX
&WHMNYJHYZWJ
,59 1QFRF 2FRGF 2FRGF

Figure 6. Architecture has limited impact on loss-to-loss scaling laws. Models are matched on pretraining data and tokenizer.

the best of our knowledge, this effect has not been observed &WHMNYJHYZWJ 1QFRFa2FRGF
9TPJSN_JW YNPYTPJS
before and could be explored in future work. 8N_J
J
 
&[JWFLJ;FQ1TXX

Takeaway 3 With fixed architecture and pretraining



data, changing the tokenizer leads to moderate changes 
in loss-to-loss scaling laws; see Fig. 5. 

 
Effect of Architecture Lastly, Fig. 6 illustrates that chang- 
ing the architecture results in only very slight changes in 
the loss-to-loss curves across pretraining data and tokenizer      
+NSJ<JG*IZ;FQNIFYNTS1TXX
settings. Unlike pretraining data and tokenizer, architecture
has little influence on train-to-train and train-to-test scaling.
5WJYWFNSNSL)FYF
( +NSJ<JG*IZ 9MJ5NQJ:(
This is particularly surprising given the significant architec-
tural differences between Llama or GPT (transformer-based Figure 7. Model size does not affect loss-to-loss scaling. The
models) and Mamba (a state-space model). These results distinct lines correspond to different pretraining distributions (see
raise an important question: Do current architectures encode Fig. 4), reinforcing that their influence is consistent across scales.
distinct inductive biases or converge to similar solutions
given the same training data? Further research is needed to
understand the implications of this finding. own Llama and Mamba models whose training settings are
matched by default.
Takeaway 4 With fixed pretraining data and tokenizer, To provide a more succinct overview, we only show train-to-
changing the architecture has limited impact on loss-to- train scaling laws in this section; additional train-to-test scal-
loss scaling laws — raising questions about the distinc- ing laws for the same intervention can be found in App. E.
tiveness of their inductive biases; see Fig. 6. We also do not show fitted power laws here since we display
many more models per plot than in §4.1, and the scaling
trends are clearly discernible.
4.2. Model Size, Context Length, and Optimization
We now examine the effect of other common design de- Effect of Model Size We first examine the influence of
cisions, such as the number or width of layers, the con- model size by training Llama and Mamba models with vary-
text length, optimizer, learning schedule, learning rate, and ing depths and widths (see App. A for details). Fig. 7 shows
weight decay. In contrast to §4.1, we can perform these the results: Despite significant differences in parameter
interventions separately since we can compare among our count, the loss-to-loss scaling trends remain unchanged.

6
LLMs on the Line

&WHMNYJHYZWJ 1QFRFa2FRGF analysis. Variations of the other settings do not affect loss-
9TPJSN_JW YNPYTPJS
(TSYJ]Y to-loss scaling coefficients, as shown in Fig. 9.
1JSLYM


&[JWFLJ;FQ1TXX

Takeaway 5 Model size (Fig. 7), context length (Fig. 8),


  and optimization settings (Fig. 9) have negligible impact
on loss-to-loss scaling laws.
 

  Given the limited impact of the factors studied in this section,
the conclusions from §4.1 should generalize well across
     variations in model size, context length, and optimization
+NSJ<JG*IZ;FQNIFYNTS1TXX
settings. For example, the substantial impact of the pretrain-
5WJYWFNSNSL)FYF
( +NSJ<JG*IZ 9MJ5NQJ:( ing distribution can also be observed in Figs. 7 and 8.

Figure 8. Context length does not affect loss-to-loss scaling. 5. Discussion and Future Work
Again, distinct lines correspond to different pretraining distribu-
tions (compare Fig. 4), validating their consistent impact. Our findings add to the understanding of loss-to-loss scal-
ing laws and reinforce prior results from vision and vision-
&WHMNYJHYZWJ 1QFRFa2FRGF language research (Taori et al., 2020; Fang et al., 2022) on
9TPJSN_JW YNPYTPJS the importance of choosing the pretraining data.
5WJYWFNSNSL +NSJ<JG*IZ

4UYNRN_JWa17a<JNLMY)JHF^a8HMJIZQJW Implications for Optimizing Downstream Performance
&IFRaJaJa(TXNSJ
&IFRaJaJa(TXNSJ Our results emphasize that the data distribution is the key
 &IFRaJaJa(TXNSJ
&[JWFLJ;FQ1TXX

&IFRaJaJa(TXNSJ for achieving a desireable loss-to-loss scaling and a in turn


&IFRaJaJa<8)
&IFRaJaJa<8)
&IFRaJaJa<8)
achieve a great downstream performance. Conversely, since
 &IFRaJaJa<8) architecture has little impact on the train-to-test conversion,
&IFR<aJaJa(TXNSJ
&IFR<aJaJa(TXNSJ
&IFR<aJaJa(TXNSJ
it can be freely optimized for better compute scaling without
&IFR<aJaJa(TXNSJ
 &IFR<aJaJa<8)
affecting downstream scaling or performance.
&IFR<aJaJa<8)
&IFR<aJaJa<8)
&IFR<aJaJa<8)
 Implications for Balancing Performance If the aim is
    
not only optimal average downstream performance but also
+NSJ<JG*IZ;FQNIFYNTS1TXX a specific weighting between different tasks, e.g., to ensure a
balanced downstream performance, individual train-to-test
Figure 9. Optimization settings do not affect loss-to-loss scaling. scaling laws can be used to tune a model’s performance.
Here, too, the pretraining data has the largest impact and
practitioners should thus consider the final application of
These findings align well with Du et al. (2025), who ob- their model already during the data curation stage. Ulti-
served that model size has little effect on loss-to-loss scaling mately, our findings underscore that pretraining data cura-
for GPT models. We extend this conclusion to Llama and tion, rather than architectural innovation, can be the primary
Mamba and across multiple pretraining distributions. driver in developing robust, generalist models.

Effect of Context Length We next investigate the effect On Architectural Biases The limited impact of even dras-
of varying the context length between 1024, 2048, and 3076 tically different architectures on loss-to-loss scaling behav-
tokens. As shown in Fig. 8, this change does not meaning- ior illustrated in §4.1 and Fig. 6 suggest that architectures
fully affect the loss-to-loss scaling curves. trained on the same data may implicitly learn highly similar
representations. This might seem intuitive, as all models
Effect of Optimization Settings Finally, we evaluate a minimize the same loss function. One might expect them
range of common optimization settings: We consider the to converge toward comparable solutions when the train-
Adam (Kingma & Ba, 2017) and AdamW (Loshchilov & ing loss approaches zero (Roeder et al., 2020). However,
Hutter, 2019) optimizers, cosine (Loshchilov & Hutter, even checkpoints of our smaller models, when trained on
2017) and WSD (Hu et al., 2024) schedules, learning rates fewer tokens, follow the same scaling across architectures.
of 3e−4 and 3e−3, and a weight decay of 0.1 or 3.3e−2. Understanding whether this implies representational and
In our training setup, models using the Adam optimizer behavioral similarity remains an intriguing open question.
generally did not converge, and we exclude them from the Beyond this, it remains to be seen whether it is possible

7
LLMs on the Line

to formulate architectures that fit the data well but exhibit induce new scaling behaviors. However, our findings sug-
different scaling trends. gest that optimization settings are not a primary driver of
loss-to-loss scaling trends within the bounds of conventional
On New Training Paradigms Our study intentionally fo- language model training.
cuses on models trained with standard loss functions and
conventional training settings to guide practitioners. The On the Goodness of Fit of Scaling Laws For a given
limited impact of existing paradigms does not preclude in- model size, we train on a number of tokens up to the
novative training approaches from improving loss-to-loss Chinchilla-optimal amount (e.g., 8.4 B tokens for a 420 M
scaling. In fact, a recent work by Saunshi et al. (2024) parameter model) and maintain a constant warmup of 5000
demonstrates that gradually increasing model depth and ini- steps, a learning rate of 3e−3, and a one-cycle cosine decay
tializing based on layers from a smaller model produces schedule. We use intermediate checkpoints as a proxy for
markedly different scaling behavior, particularly in how models trained on fewer tokens. While (Hoffmann et al.,
perplexity translates to downstream accuracy. Similar struc- 2022) suggest adapting the scheduler in this case to align
tured growth approaches could offer new pathways for im- with the number of tokens, we are constrained by compute
proving scaling efficiency and generalization for decoder- and cannot train thousands of models from scratch. While
only LLMs trained with next-token prediction. We leave using intermediate checkpoints alongside models of dif-
this exercise for future work. ferent sizes may influence the specific shape of the fitted
compute-to-loss scaling laws, we find that the overall quality
On the Exhaustiveness of Interventions in §4.1 Our of fit benefits greatly from the additional data points. Since
study clearly distinguishes between factors with substantial we only use compute-to-loss scaling laws to estimate the
and limited impact on loss-to-loss scaling. While our con- entropy terms in Eq. (1) and are interested primarily in the
clusions are inherently shaped by the specific settings we impact of interventions on loss-to-loss scaling, we do not
explored, the observed trends provide strong empirical evi- expect this choice to significantly impact our conclusions.
dence for these distinctions. Given the strong and consistent We also not that using intermediate checkpoints for fitting
impact of pretraining data and tokenizer, we can confidently scaling laws is not unprecedented in the literature (Schaeffer
conclude that these interventions affect loss-to-loss scaling. et al., 2024; Brandfonbrener et al., 2024).
While we observed only a limited impact of the architecture,
this effect was also consistent across major state-of-the-art 6. Conclusion
architectures including Llama, GPT, and Mamba — which
collectively represent the dominant paradigms in large-scale In this work, we systematically investigate loss-to-loss scal-
language modeling. Given this exhaustive set, it is hard ing in LLMs, identifying key factors that shape its behavior.
to argue that other architectures would meaningfully alter Our large-scale interventional analysis — spanning over
loss-to-loss scaling. 6000 model checkpoints across architectures, tokenizers,
and training setups — reveals that loss-to-loss scaling con-
On the Exhaustiveness of Interventions in §4.2 Across sistently follows shifted power-law trends, enabling predict-
the wide range of size configurations (App. A) we test, all ing test performance from training loss.
models exhibit very consistent loss-to-loss scaling. Simi- We identify pretraining data and tokenizer as the dominant
larly, the effect we observed for different context lengths factors shaping these scaling laws, highlighting the impor-
is very consistent within our test range (1024, 2048, 3076), tance of data curation. Architecture has limited impact,
which aligns with commonly used configurations (Black with models as different as LLaMA (transformer-based) and
et al., 2021; Wang & Komatsuzaki, 2021; Biderman et al., Mamba (a state-space model) exhibiting nearly identical
2023; Penedo et al., 2024; Black et al., 2022). While we scaling when trained on the same data and tokenizer. Model
acknowledge the possibility that larger models or longer size, context length, and optimization settings have negligi-
context lengths could influence loss-to-loss scaling, such ble influence, such that loss-to-loss scaling remains stable
an effect — if present — is unlikely. For optimization set- across different configurations.
tings, we again consider configurations widely used in LLM
training (Shoeybi et al., 2020; Karpathy, 2022; Videau et al., Our findings underline the importance of pretraining data
2024), including variations in optimizer type, learning rate, for downstream performance and robustness and suggest
weight decay, and scheduling. While our results indicate that different LLM might share similar architectural biases.
that these choices do not meaningfully alter loss-to-loss scal- Given our observations, practitioners should prioritize cu-
ing within the explored settings, we acknowledge that the rating high-quality pretraining data to optimize downstream
space of optimization techniques is vast, and our list is not performance, while architectures and training settings can
exhaustive. It remains possible that a principled optimiza- be adjusted freely for efficiency.
tion strategy, different from current best practices, could

8
LLMs on the Line

Impact Bisk, Y., Zellers, R., Bras, R. L., Gao, J., and Choi, Y.
Piqa: Reasoning about physical commonsense in natural
This paper presents work whose goal is to advance the field language, 2019. URL [Link]
of Machine Learning. There are many potential societal 1911.11641.
consequences of our work, none of which we feel must be
specifically highlighted here. Black, S., Gao, L., Wang, P., Leahy, C., and
Biderman, S. Gpt-neo: Large scale autore-
Author Contributions gressive language modeling with mesh-tensorflow.
2021. URL [Link]
This project was co-led and coordinated by PM and TW. TW org/CorpusID:245758737.
and PM jointly developed the method, incorporating insights
from WB and MB. PM was responsible for training and Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao,
evaluating the models. SM assisted with certain evaluations L., Golding, L., He, H., Leahy, C., McDonell, K., Phang,
and provided some details for Hugging Face models. TW J., Pieler, M., Prashanth, U. S., Purohit, S., Reynolds, L.,
analyzed the data and conducted the scaling law and loss- Tow, J., Wang, B., and Weinbach, S. Gpt-neox-20b: An
to-loss fits. Both TW and PM contributed to writing the open-source autoregressive language model, 2022. URL
manuscript, with feedback from WB. TW created all the [Link]
figures with feedback from PM.
Brandfonbrener, D., Anand, N., Vyas, N., Malach, E., and
Acknowledgements Kakade, S. Loss-to-loss prediction: Scaling laws for all
datasets, 2024. URL [Link]
We would like to thank (in alphabetical order) Ameya 2411.12925.
Prabhu, Attila Juhos, Evgenia Rusak, Fanfei Li, Jack Brady,
Thomas Klein, and Vishaal Udandarao for helpful discus- Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal,
sions and feedback. A., Schoenick, C., and Tafjord, O. Think you have
solved question answering? try arc, the ai2 reasoning
This work was supported by the German Federal Ministry
challenge, 2018. URL [Link]
of Education and Research (BMBF): Tübingen AI Center,
1803.05457.
FKZ: 01IS18039A. WB acknowledges financial support via
an Emmy Noether Grant funded by the German Research
Dao, T. and Gu, A. Transformers are ssms: Generalized
Foundation (DFG) under grant no. BR 6382/1-1 and via the
models and efficient algorithms through structured state
Open Philantropy Foundation funded by the Good Ventures
space duality, 2024. URL [Link]
Foundation. WB is a member of the Machine Learning
abs/2405.21060.
Cluster of Excellence, EXC number 2064/1 – Project num-
ber 390727645. This research utilized compute resources
Dodge, J., Sap, M., Marasović, A., Agnew, W., Ilharco,
at the Tübingen Machine Learning Cloud, DFG FKZ INST
G., Groeneveld, D., Mitchell, M., and Gardner, M. Doc-
37/1057-1 FUGG.
umenting large webtext corpora: A case study on the
We thank the International Max Planck Research School for colossal clean crawled corpus, 2021. URL https:
Intelligent Systems (IMPRS-IS) for supporting PM and TW. //[Link]/abs/2104.08758.

Du, Z., Zeng, A., Dong, Y., and Tang, J. Understanding


References emergent abilities of language models from the loss per-
Awadalla, A., Wortsman, M., Ilharco, G., Min, S., Mag- spective, 2025. URL [Link]
nusson, I., Hajishirzi, H., and Schmidt, L. Exploring 2403.15796.
the landscape of distributional robustness for question an-
swering models, 2022. URL [Link] Fang, A., Ilharco, G., Wortsman, M., Wan, Y., Shankar,
abs/2210.12517. V., Dave, A., and Schmidt, L. Data determines distri-
butional robustness in contrastive language image pre-
training (clip), 2022. URL [Link]
Biderman, S., Schoelkopf, H., Anthony, Q., Bradley, H., abs/2205.01397.
O’Brien, K., Hallahan, E., Khan, M. A., Purohit, S.,
Prashanth, U. S., Raff, E., Skowron, A., Sutawika, L., Gadre, S. Y., Smyrnis, G., Shankar, V., Gururangan, S.,
and van der Wal, O. Pythia: A suite for analyzing large Wortsman, M., Shao, R., Mercat, J., Fang, A., Li, J., Keh,
language models across training and scaling, 2023. URL S., Xin, R., Nezhurina, M., Vasiljevic, I., Jitsev, J., Sol-
[Link] daini, L., Dimakis, A. G., Ilharco, G., Koh, P. W., Song,

9
LLMs on the Line

S., Kollar, T., Carmon, Y., Dave, A., Heckel, R., Muen- Hu, S., Tu, Y., Han, X., He, C., Cui, G., Long, X., Zheng,
nighoff, N., and Schmidt, L. Language models scale reli- Z., Fang, Y., Huang, Y., Zhao, W., Zhang, X., Thai, Z. L.,
ably with over-training and on downstream tasks, 2024. Zhang, K., Wang, C., Yao, Y., Zhao, C., Zhou, J., Cai,
URL [Link] J., Zhai, Z., Ding, N., Jia, C., Zeng, G., Li, D., Liu, Z.,
and Sun, M. Minicpm: Unveiling the potential of small
Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., language models with scalable training strategies, 2024.
Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., URL [Link]
Presser, S., and Leahy, C. The pile: An 800gb dataset of
diverse text for language modeling, 2020. URL https: Isik, B., Ponomareva, N., Hazimeh, H., Paparas, D., Vassil-
//[Link]/abs/2101.00027. vitskii, S., and Koyejo, S. Scaling laws for downstream
Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, task performance of large language models, 2024. URL
A., Foster, C., Golding, L., Hsu, J., Noac’h, A. L., Li, [Link]
H., McDonell, K., Muennighoff, N., Ociepa, C., Phang,
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B.,
J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika,
Chess, B., Child, R., Gray, S., Radford, A., Wu, J.,
L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou,
and Amodei, D. Scaling laws for neural language mod-
A. A framework for few-shot language model evaluation,
els, 2020. URL [Link]
07 2024. URL [Link]
08361.
12608602.
Gordon, A. S., Kozareva, Z., and Roemmele, M. Choice Karpathy, A. nanogpt, 2022. URL [Link]
of plausible alternatives: An evaluation of common- com/karpathy/nanoGPT.
sense causal reasoning. In AAAI Spring Symposium:
Logical Formalizations of Commonsense Reasoning, Kingma, D. P. and Ba, J. Adam: A method for stochastic op-
2011. URL [Link] timization, 2017. URL [Link]
org/CorpusID:434646. 1412.6980.

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, Loshchilov, I. and Hutter, F. Sgdr: Stochastic gradi-
A., Al-Dahle, A., et al. The llama 3 herd of models, 2024. ent descent with warm restarts, 2017. URL https:
URL [Link] //[Link]/abs/1608.03983.

Groeneveld, D., Beltagy, I., Walsh, P., Bhagia, A., Kinney, Loshchilov, I. and Hutter, F. Decoupled weight decay regu-
R., Tafjord, O., Jha, A. H., Ivison, H., Magnusson, I., larization, 2019. URL [Link]
Wang, Y., et al. Olmo: Accelerating the science of lan- 1711.05101.
guage models. arXiv preprint arXiv:2402.00838, 2024.
Madaan, L., Singh, A. K., Schaeffer, R., Poulton, A.,
Gu, A. and Dao, T. Mamba: Linear-time sequence mod- Koyejo, S., Stenetorp, P., Narang, S., and Hupkes, D.
eling with selective state spaces, 2024. URL https: Quantifying variance in evaluation benchmarks, 2024.
//[Link]/abs/2312.00752. URL [Link]
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika,
M., Song, D., and Steinhardt, J. Measuring massive Mayilvahanan, P., Wiedemer, T., Rusak, E., Bethge, M.,
multitask language understanding, 2021. URL https: and Brendel, W. Does clip’s generalization performance
//[Link]/abs/2009.03300. mainly stem from high train-test similarity?, 2024a. URL
[Link]
Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H.,
Kianinejad, H., Patwary, M. M. A., Yang, Y., and Zhou, Mayilvahanan, P., Zimmermann, R. S., Wiedemer, T.,
Y. Deep learning scaling is predictable, empirically, 2017. Rusak, E., Juhos, A., Bethge, M., and Brendel, W. In
URL [Link] search of forgotten domain generalization, 2024b. URL
[Link]
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E.,
Cai, T., Rutherford, E., de Las Casas, D., Hendricks, Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can
L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., a suit of armor conduct electricity? a new dataset for
Millican, K., van den Driessche, G., Damoc, B., Guy, open book question answering, 2018. URL https://
A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., [Link]/abs/1809.02789.
Vinyals, O., and Sifre, L. Training compute-optimal large
language models, 2022. URL [Link] Miller, J., Taori, R., Raghunathan, A., Sagawa, S., Koh,
abs/2203.15556. P. W., Shankar, V., Liang, P., Carmon, Y., and Schmidt,

10
LLMs on the Line

L. Accuracy on the line: On the strong correlation be- Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper,
tween out-of-distribution and in-distribution generaliza- J., and Catanzaro, B. Megatron-lm: Training multi-
tion, 2021. URL [Link] billion parameter language models using model par-
04649. allelism, 2020. URL [Link]
1909.08053.
Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cap-
pelli, A., Alobeidli, H., Pannier, B., Almazrouei, E., Talmor, A., Herzig, J., Lourie, N., and Berant, J. Com-
and Launay, J. The refinedweb dataset for falcon llm: monsenseqa: A question answering challenge target-
Outperforming curated corpora with web data, and web ing commonsense knowledge, 2019. URL https:
data only, 2023. URL [Link] //[Link]/abs/1811.00937.
2306.01116.
Taori, R., Dave, A., Shankar, V., Carlini, N., Recht, B., and
Penedo, G., Kydlı́ček, H., allal, L. B., Lozhkov, A., Mitchell, Schmidt, L. Measuring robustness to natural distribution
M., Raffel, C., Werra, L. V., and Wolf, T. The fineweb shifts in image classification, 2020. URL https://
datasets: Decanting the web for the finest text data [Link]/abs/2007.00644.
at scale, 2024. URL [Link]
2406.17557. Tay, Y., Dehghani, M., Abnar, S., Chung, H. W., Fedus,
W., Rao, J., Narang, S., Tran, V. Q., Yogatama, D.,
Porian, T., Wortsman, M., Jitsev, J., Schmidt, L., and Car- and Metzler, D. Scaling laws vs model architectures:
mon, Y. Resolving discrepancies in compute-optimal How does inductive bias influence scaling?, 2022. URL
scaling of language models, 2025. URL https:// [Link]
[Link]/abs/2406.19146.
Videau, M., Idrissi, B. Y., Haziza, D., Wehrstedt, L., Copet,
Radford, A., Wu, J., Child, R., Luan, D., Amodei,
J., Teytaud, O., and Lopez-Paz, D. Meta Lingua: A mini-
D., and Sutskever, I. Language models are unsu-
mal PyTorch LLM training library, 2024. URL https:
pervised multitask learners. 2019. URL https:
//[Link]/facebookresearch/lingua.
//[Link]/CorpusID:
160025533. Wang, B. and Komatsuzaki, A. Gpt-j-6b: A 6 billion param-
Roeder, G., Metz, L., and Kingma, D. P. On linear eter autoregressive language model, 2021.
identifiability of learned representations, 2020. URL Wang, S., Chen, Z., Li, B., He, K., Zhang, M., and Wang, J.
[Link] Scaling laws across model architectures: A comparative
Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y. analysis of dense and moe models in large language mod-
Winogrande: An adversarial winograd schema challenge els, 2024. URL [Link]
at scale, 2019. URL [Link] 05661.
1907.10641.
Wiedemer, T., Sharma, Y., Prabhu, A., Bethge, M., and
Sap, M., Rashkin, H., Chen, D., LeBras, R., and Choi, Brendel, W. Pretraining frequency predicts compositional
Y. Socialiqa: Commonsense reasoning about social in- generalization of CLIP on real-world tasks. In NeurIPS
teractions, 2019. URL [Link] 2024 Workshop on Compositional Learning: Perspectives,
1904.09728. Methods, and Paths Forward, 2024. URL https://
[Link]/forum?id=NDXoM1wYgl.
Saunshi, N., Karp, S., Krishnan, S., Miryoosefi, S., Reddi,
S. J., and Kumar, S. On the inductive bias of stacking Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C.,
towards improving reasoning, 2024. URL https:// Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M.,
[Link]/abs/2409.19044. Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite,
Y., Plu, J., Xu, C., Scao, T. L., Gugger, S., Drame, M.,
Schaeffer, R., Schoelkopf, H., Miranda, B., Mukobi, G.,
Lhoest, Q., and Rush, A. M. Huggingface’s transformers:
Madan, V., Ibrahim, A., Bradley, H., Biderman, S., and
State-of-the-art natural language processing, 2020. URL
Koyejo, S. Why has predicting downstream capabilities
[Link]
of frontier ai models with scale remained elusive?, 2024.
URL [Link] Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and
Shen, Z., Tao, T., Ma, L., Neiswanger, W., Liu, Z., Wang, Choi, Y. Hellaswag: Can a machine really finish your
H., Tan, B., Hestness, J., Vassilieva, N., Soboleva, D., sentence?, 2019. URL [Link]
and Xing, E. Slimpajama-dc: Understanding data combi- 1905.07830.
nations for llm training, 2024. URL [Link]
org/abs/2309.10818.

11
LLMs on the Line

Table 1. Details of the models we trained from scratch using the tiktoken tokenizer.

Architecture Width Depth Parameters


Llama 1024 12 416 M
Llama 1024 8 365 M
Llama 1024 4 314 M
Llama 512 12 172 M
Llama 512 8 158 M
Llama 512 4 145 M
Llama 256 12 76 M
Llama 256 8 72 M
Llama 256 4 59 M
Mamba 1024 24 420 M
Mamba 1024 16 367 M
Mamba 1024 8 315 M
Mamba 512 24 172 M
Mamba 512 16 158 M
Mamba 512 8 145 M
Mamba 256 24 76 M
Mamba 256 16 73 M
Mamba 256 8 69 M

A. Model Details
In addition to the models discussed in §4.1, we add additional details of Llama and Mamba models we trained of different
depths and widths.

B. Loss-to-Loss Scaling Across Settings


We supplement Fig. 2 from §3 with additional architecture-pretraining pairings in Figs. 10 to 15.

C. Intervention Results for Different Choices of x-Axis


We show variations of Figs. 4 to 6 from §4.1 with C4 validation loss as the x-axis in Figs. 16 to 18. Variations for Figs. 7
to 9 from §4.2 are shown in Figs. 19 to 21.

D. Intervention Results without Averaging


The causal analysis in §4 was performed on scaling laws for average validation or test loss. Figs. 22 to 27 show illustrative
results on scaling laws for individual datasets.

E. Additional Train-to-Test Scaling Laws


We provide train-to-test scaling laws for the interventions performed in §4.2 and Figs. 7 to 9 in Figs. 28 to 33.

12
LLMs on the Line

9WFNSYT9WFNS 9WFNSYT9JXY


 

;FQNIFYNTS1TXX
 

9JXY1TXX
 
 

 

     
+NSJ<JG*IZ;FQNIFYNTS1TXX +NSJ<JG*IZ;FQNIFYNTS1TXX
;FQNIFYNTS8JY 9JXY8JY
9MJ5NQJ:( &7((MFQQJSLJ
7JKNSJ<JG &7(*FX^
8QNRUFOFRF 4UJS'TTP6&
( 5.6&
<NSTLWFSIJ
-JQQF8\FL
(45&
(TRRTS8JSXJ6&
8THNFQ.6F
221:

Figure 10. Loss-to-Loss Scaling for FineWeb-Edu-trained Llama.

9WFNSYT9WFNS 9WFNSYT9JXY
 
 

;FQNIFYNTS1TXX


9JXY1TXX









       
(;FQNIFYNTS1TXX (;FQNIFYNTS1TXX
;FQNIFYNTS8JY 9JXY8JY
9MJ5NQJ:( &7((MFQQJSLJ
7JKNSJ<JG &7(*FX^
8QNRUFOFRF 4UJS'TTP6&
+NSJ<JG*IZ 5.6&
<NSTLWFSIJ
-JQQF8\FL
(45&
(TRRTS8JSXJ6&
8THNFQ.6F
221:

Figure 11. Loss-to-Loss Scaling for C4-trained Llama.

13
LLMs on the Line

9WFNSYT9WFNS 9WFNSYT9JXY
 

 

;FQNIFYNTS1TXX
 

9JXY1TXX
 

 

 

       


+NSJ<JG*IZ;FQNIFYNTS1TXX +NSJ<JG*IZ;FQNIFYNTS1TXX
;FQNIFYNTS8JY 9JXY8JY
9MJ5NQJ:( &7((MFQQJSLJ
7JKNSJ<JG &7(*FX^
8QNRUFOFRF 4UJS'TTP6&
( 5.6&
<NSTLWFSIJ
-JQQF8\FL
(45&
(TRRTS8JSXJ6&
8THNFQ.6F
221:

Figure 12. Loss-to-Loss Scaling for The Pile-trained Llama.

9WFNSYT9WFNS 9WFNSYT9JXY
 
 
;FQNIFYNTS1TXX

 
9JXY1TXX

 

 

 


     
+NSJ<JG*IZ;FQNIFYNTS1TXX +NSJ<JG*IZ;FQNIFYNTS1TXX
;FQNIFYNTS8JY 9JXY8JY
9MJ5NQJ:( &7((MFQQJSLJ
7JKNSJ<JG &7(*FX^
8QNRUFOFRF 4UJS'TTP6&
( 5.6&
<NSTLWFSIJ
-JQQF8\FL
(45&
(TRRTS8JSXJ6&
8THNFQ.6F
221:

Figure 13. Loss-to-Loss Scaling for FineWeb-Edu-trained Mamba.

14
LLMs on the Line

9WFNSYT9WFNS 9WFNSYT9JXY



 

;FQNIFYNTS1TXX

9JXY1TXX
 
 




       
(;FQNIFYNTS1TXX (;FQNIFYNTS1TXX
;FQNIFYNTS8JY 9JXY8JY
9MJ5NQJ:( &7((MFQQJSLJ
7JKNSJ<JG &7(*FX^
8QNRUFOFRF 4UJS'TTP6&
+NSJ<JG*IZ 5.6&
<NSTLWFSIJ
-JQQF8\FL
(45&
(TRRTS8JSXJ6&
8THNFQ.6F
221:

Figure 14. Loss-to-Loss Scaling for C4-trained Mamba.

9WFNSYT9WFNS 9WFNSYT9JXY




;FQNIFYNTS1TXX


9JXY1TXX

 

 

 

       


+NSJ<JG*IZ;FQNIFYNTS1TXX +NSJ<JG*IZ;FQNIFYNTS1TXX
;FQNIFYNTS8JY 9JXY8JY
9MJ5NQJ:( &7((MFQQJSLJ
7JKNSJ<JG &7(*FX^
8QNRUFOFRF 4UJS'TTP6&
( 5.6&
<NSTLWFSIJ
-JQQF8\FL
(45&
(TRRTS8JSXJ6&
8THNFQ.6F
221:

Figure 15. Loss-to-Loss Scaling for The Pile-trained Mamba.

15
LLMs on the Line

&WHMNYJHYZWJ 1QFRF &WHMNYJHYZWJ 1QFRF &WHMNYJHYZWJ 1QFRF &WHMNYJHYZWJ 2FRGF &WHMNYJHYZWJ 2FRGF &WHMNYJHYZWJ ,59
9TPJSN_JW YNPYTPJS 9TPJSN_JW LUY 9TPJSN_JW LUY-+ 9TPJSN_JW YNPYTPJS 9TPJSN_JW LUY 9TPJSN_JW LUYSJT]
&[JWFLJ;FQ1TXX


9WFNSYT9WFNS







&[JWFLJ9JXY1TXX
9WFNSYT9JXY





              
(;FQ1TXX (;FQ1TXX (;FQ1TXX (;FQ1TXX (;FQ1TXX (;FQ1TXX
5WJYWFNSNSL)FYF
( +NSJ<JG*IZ 9MJ5NQJ 9MJ5NQJ)JIZUJI 9MJ5NQJ:(

Figure 16. Pretraining data has a substantial impact on loss-to-loss scaling laws.

&WHMNYJHYZWJ 1QFRF &WHMNYJHYZWJ 1QFRF &WHMNYJHYZWJ 1QFRF &WHMNYJHYZWJ 2FRGF &WHMNYJHYZWJ 2FRGF &WHMNYJHYZWJ ,59
5WJYWFNSNSL ( 5WJYWFNSNSL +<*IZ 5WJYWFNSNSL 5NQJ:( 5WJYWFNSNSL ( 5WJYWFNSNSL 5NQJ:( 5WJYWFNSNSL 5NQJ


&[JWFLJ;FQ1TXX
9WFNSYT9WFNS







&[JWFLJ9JXY1TXX
9WFNSYT9JXY





            
(;FQ1TXX (;FQ1TXX (;FQ1TXX (;FQ1TXX (;FQ1TXX (;FQ1TXX
9TPJSN_JW
LUYSJT] LUY LUY-+ YNPYTPJS

Figure 17. The tokenizer has a moderate impact on loss-to-loss scaling laws.

16
LLMs on the Line

5WJYWFNSNSL ( 5WJYWFNSNSL ( 5WJYWFNSNSL +<*IZ 5WJYWFNSNSL +<*IZ 5WJYWFNSNSL 5NQJ:( 5WJYWFNSNSL 5NQJ
9TPJSN_JW LUY 9TPJSN_JW YNPYTPJS 9TPJSN_JW LUY 9TPJSN_JW YNPYTPJS 9TPJSN_JW YNPYTPJS 9TPJSN_JW LUYSJT]
&[JWFLJ;FQ1TXX


9WFNSYT9WFNS







&[JWFLJ9JXY1TXX
9WFNSYT9JXY





              
(;FQ1TXX (;FQ1TXX (;FQ1TXX (;FQ1TXX (;FQ1TXX (;FQ1TXX
&WHMNYJHYZWJ
,59 1QFRF 2FRGF 2FRGF

Figure 18. Architecture has limited impact on loss-to-loss scaling laws.

&WHMNYJHYZWJ 1QFRFa2FRGF
9TPJSN_JW YNPYTPJS
8N_J
 J

&[JWFLJ;FQ1TXX





 


     
(;FQNIFYNTS1TXX
5WJYWFNSNSL)FYF
( +NSJ<JG*IZ 9MJ5NQJ:(

Figure 19. Model size does not affect train-to-test scaling.

17
LLMs on the Line

&WHMNYJHYZWJ 1QFRFa2FRGF
9TPJSN_JW YNPYTPJS
(TSYJ]Y
1JSLYM

&[JWFLJ;FQ1TXX 
 

 

 

    
(;FQNIFYNTS1TXX
5WJYWFNSNSL)FYF
( +NSJ<JG*IZ 9MJ5NQJ:(

Figure 20. Context length does not affect train-to-test scaling.

&WHMNYJHYZWJ 1QFRFa2FRGF
9TPJSN_JW YNPYTPJS
5WJYWFNSNSL +NSJ<JG*IZ

4UYNRN_JWa17a<JNLMY)JHF^a8HMJIZQJW
&IFRaJaJa(TXNSJ
&IFRaJaJa(TXNSJ
 &IFRaJaJa(TXNSJ
&[JWFLJ;FQ1TXX

&IFRaJaJa(TXNSJ
&IFRaJaJa<8)
&IFRaJaJa<8)
&IFRaJaJa<8)
 &IFRaJaJa<8)
&IFR<aJaJa(TXNSJ
&IFR<aJaJa(TXNSJ
&IFR<aJaJa(TXNSJ
 &IFR<aJaJa(TXNSJ
&IFR<aJaJa<8)
&IFR<aJaJa<8)
&IFR<aJaJa<8)
&IFR<aJaJa<8)


    
(;FQNIFYNTS1TXX

Figure 21. Optimizer settings do not affect train-to-test scaling.

18
LLMs on the Line

&WHMNYJHYZWJ 1QFRF &WHMNYJHYZWJ 1QFRF &WHMNYJHYZWJ 1QFRF &WHMNYJHYZWJ 2FRGF &WHMNYJHYZWJ 2FRGF &WHMNYJHYZWJ ,59
9TPJSN_JW YNPYTPJS 9TPJSN_JW LUY 9TPJSN_JW LUY-+ 9TPJSN_JW YNPYTPJS 9TPJSN_JW LUY 9TPJSN_JW LUYSJT]
9MJ5NQJ:(;FQ1TXX


9WFNSYT9WFNS




-JQQF8\FL9JXY1TXX
9WFNSYT9JXY





                
+<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX
5WJYWFNSNSL)FYF
( +NSJ<JG*IZ 9MJ5NQJ 9MJ5NQJ)JIZUJI 9MJ5NQJ:(

Figure 22. Pretraining data has a substantial impact on loss-to-loss scaling laws.

&WHMNYJHYZWJ 1QFRF &WHMNYJHYZWJ 1QFRF &WHMNYJHYZWJ 1QFRF &WHMNYJHYZWJ 2FRGF &WHMNYJHYZWJ 2FRGF &WHMNYJHYZWJ ,59
9TPJSN_JW YNPYTPJS 9TPJSN_JW LUY 9TPJSN_JW LUY-+ 9TPJSN_JW YNPYTPJS 9TPJSN_JW LUY 9TPJSN_JW LUYSJT]

9WFNSYT9WFNS
(;FQ1TXX






&7(*FX^9JXY1TXX
9WFNSYT9JXY





                
+<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX
5WJYWFNSNSL)FYF
( +NSJ<JG*IZ 9MJ5NQJ 9MJ5NQJ)JIZUJI 9MJ5NQJ:(

Figure 23. Pretraining data has a substantial impact on loss-to-loss scaling laws.

19
LLMs on the Line

&WHMNYJHYZWJ 1QFRF &WHMNYJHYZWJ 1QFRF &WHMNYJHYZWJ 1QFRF &WHMNYJHYZWJ 2FRGF &WHMNYJHYZWJ 2FRGF &WHMNYJHYZWJ ,59
5WJYWFNSNSL ( 5WJYWFNSNSL +<*IZ 5WJYWFNSNSL 5NQJ:( 5WJYWFNSNSL ( 5WJYWFNSNSL 5NQJ:( 5WJYWFNSNSL 5NQJ

9MJ5NQJ:(;FQ1TXX
9WFNSYT9WFNS




-JQQF8\FL9JXY1TXX
9WFNSYT9JXY





              
+<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX
9TPJSN_JW
LUYSJT] LUY LUY-+ YNPYTPJS

Figure 24. The tokenizer has a moderate impact on loss-to-loss scaling laws.

&WHMNYJHYZWJ 1QFRF &WHMNYJHYZWJ 1QFRF &WHMNYJHYZWJ 1QFRF &WHMNYJHYZWJ 2FRGF &WHMNYJHYZWJ 2FRGF &WHMNYJHYZWJ ,59
5WJYWFNSNSL ( 5WJYWFNSNSL +<*IZ 5WJYWFNSNSL 5NQJ:( 5WJYWFNSNSL ( 5WJYWFNSNSL 5NQJ:( 5WJYWFNSNSL 5NQJ

9WFNSYT9WFNS
(;FQ1TXX





&7(*FX^9JXY1TXX


9WFNSYT9JXY



              
+<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX
9TPJSN_JW
LUYSJT] LUY LUY-+ YNPYTPJS

Figure 25. The tokenizer has a moderate impact on loss-to-loss scaling laws.

20
LLMs on the Line

5WJYWFNSNSL ( 5WJYWFNSNSL ( 5WJYWFNSNSL +<*IZ 5WJYWFNSNSL +<*IZ 5WJYWFNSNSL 5NQJ:( 5WJYWFNSNSL 5NQJ
9TPJSN_JW LUY 9TPJSN_JW YNPYTPJS 9TPJSN_JW LUY 9TPJSN_JW YNPYTPJS 9TPJSN_JW YNPYTPJS 9TPJSN_JW LUYSJT]
9MJ5NQJ:(;FQ1TXX


9WFNSYT9WFNS






-JQQF8\FL9JXY1TXX


9WFNSYT9JXY






                
+<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX
&WHMNYJHYZWJ
,59 1QFRF 2FRGF 2FRGF

Figure 26. Architecture has limited impact on loss-to-loss scaling laws.

5WJYWFNSNSL ( 5WJYWFNSNSL ( 5WJYWFNSNSL +<*IZ 5WJYWFNSNSL +<*IZ 5WJYWFNSNSL 5NQJ:( 5WJYWFNSNSL 5NQJ
9TPJSN_JW LUY 9TPJSN_JW YNPYTPJS 9TPJSN_JW LUY 9TPJSN_JW YNPYTPJS 9TPJSN_JW YNPYTPJS 9TPJSN_JW LUYSJT]



9WFNSYT9WFNS
(;FQ1TXX







&7(*FX^9JXY1TXX
9WFNSYT9JXY





                
+<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX +<*IZ;FQ1TXX
&WHMNYJHYZWJ
,59 1QFRF 2FRGF 2FRGF

Figure 27. Architecture has limited impact on loss-to-loss scaling laws.

21
LLMs on the Line

&WHMNYJHYZWJ 1QFRFa2FRGF
9TPJSN_JW YNPYTPJS
8N_J
J
 
&[JWFLJ9JXY1TXX

 

 

 
     
(;FQNIFYNTS1TXX
5WJYWFNSNSL)FYF
( +NSJ<JG*IZ 9MJ5NQJ:(

Figure 28. Model size does not affect train-to-test scaling.

&WHMNYJHYZWJ 1QFRFa2FRGF
9TPJSN_JW YNPYTPJS
8N_J
J
 
&[JWFLJ9JXY1TXX

 

 

 
     
+NSJ<JG*IZ;FQNIFYNTS1TXX
5WJYWFNSNSL)FYF
( +NSJ<JG*IZ 9MJ5NQJ:(

Figure 29. Context length does not affect train-to-test scaling.

22
LLMs on the Line

&WHMNYJHYZWJ 1QFRFa2FRGF
9TPJSN_JW YNPYTPJS
(TSYJ]Y
1JSLYM

&[JWFLJ9JXY1TXX 

 

 




    
(;FQNIFYNTS1TXX
5WJYWFNSNSL)FYF
( +NSJ<JG*IZ 9MJ5NQJ:(

Figure 30. Optimizer settings do not affect train-to-test scaling.

&WHMNYJHYZWJ 1QFRFa2FRGF
9TPJSN_JW YNPYTPJS
(TSYJ]Y
 1JSLYM

&[JWFLJ9JXY1TXX


 

 




    
+NSJ<JG*IZ;FQNIFYNTS1TXX
5WJYWFNSNSL)FYF
( +NSJ<JG*IZ 9MJ5NQJ:(

Figure 31. Model size does not affect train-to-test scaling.

23
LLMs on the Line

&WHMNYJHYZWJ 1QFRFa2FRGF
9TPJSN_JW YNPYTPJS
5WJYWFNSNSL +NSJ<JG*IZ

4UYNRN_JWa17a<JNLMY)JHF^a8HMJIZQJW
&IFRaJaJa(TXNSJ
 &IFRaJaJa(TXNSJ
&IFRaJaJa(TXNSJ
&[JWFLJ9JXY1TXX

 &IFRaJaJa(TXNSJ
&IFRaJaJa<8)
&IFRaJaJa<8)
 &IFRaJaJa<8)
&IFRaJaJa<8)
&IFR<aJaJa(TXNSJ
 &IFR<aJaJa(TXNSJ
&IFR<aJaJa(TXNSJ
 &IFR<aJaJa(TXNSJ
&IFR<aJaJa<8)
&IFR<aJaJa<8)
 &IFR<aJaJa<8)
&IFR<aJaJa<8)

    
(;FQNIFYNTS1TXX

Figure 32. Context length does not affect train-to-test scaling.

&WHMNYJHYZWJ 1QFRFa2FRGF
9TPJSN_JW YNPYTPJS
5WJYWFNSNSL +NSJ<JG*IZ

4UYNRN_JWa17a<JNLMY)JHF^a8HMJIZQJW
&IFRaJaJa(TXNSJ
 &IFRaJaJa(TXNSJ
&IFRaJaJa(TXNSJ
&[JWFLJ9JXY1TXX

 &IFRaJaJa(TXNSJ
&IFRaJaJa<8)
&IFRaJaJa<8)
 &IFRaJaJa<8)
&IFRaJaJa<8)
&IFR<aJaJa(TXNSJ
 &IFR<aJaJa(TXNSJ
&IFR<aJaJa(TXNSJ
 &IFR<aJaJa(TXNSJ
&IFR<aJaJa<8)
&IFR<aJaJa<8)
 &IFR<aJaJa<8)
&IFR<aJaJa<8)

    
+NSJ<JG*IZ;FQNIFYNTS1TXX

Figure 33. Optimizer settings do not affect train-to-test scaling.

24

You might also like