AI Scaling and limitation
AI Scaling and limitation
2"
#import "@preview/plotst:0.1.0"
#import "@preview/diagraph:0.1.0"
#import "@preview/tablex:0.0.5": tablex, cellx, rowspanx
#set page(
numbering: "1",
number-align: center,
header: align(right)[AI Scaling Laws and Model Efficiency],
)
== Introduction
The remarkable progress in artificial intelligence over the past decade has been
largely driven by an unprecedented increase in computational scale. Modern large
language models (LLMs) and multimodal systems with trillions of parameters trained
on vast datasets have demonstrated capabilities that were once thought to be the
exclusive domain of human intelligence. Behind this explosion in capability lies a
fascinating empirical phenomenon: scaling laws that govern the relationship between
model size, dataset size, compute resources, and ultimate performance. This essay
explores these scaling relationships, their theoretical foundations, empirical
validation, and the technical frontiers in improving model efficiency beyond simply
scaling up computation.
The foundational work on neural network scaling laws revealed that model
performance follows a power law relationship with respect to key factors:
$
L(N, D, C) \approx (N^{-\alpha_N} + D^{-\alpha_D} + C^{-\alpha_C})
$
Where:
- $L$ is the loss (lower is better)
- $N$ is the number of parameters
- $D$ is the dataset size
- $C$ is the compute budget
- $\alpha_N$, $\alpha_D$, and $\alpha_C$ are scaling exponents
These exponents typically range from 0.05 to 0.5, depending on the specific
architecture and task domain. The power law relationship suggests that performance
improvements from scaling follow a pattern of diminishing returns, yet remain
predictable.
The Chinchilla scaling laws, proposed by Hoffmann et al. (2022), revised earlier
work by suggesting that models had been significantly undertrained. They proposed
an optimal allocation between model size and training tokens:
$
N_{optimal} \propto C^{0.5}
$
$
D_{optimal} \propto C^{0.5}
$
This implies that compute should be split roughly equally between increasing model
size and increasing training data—a departure from previous practice that favored
larger models over more extensive training.
#figure(
cetz.canvas({
import cetz.draw: *
let w = 10
let h = 7
Recent research has extended scaling laws across different architectural families:
#let scaling_data = (
("Transformer (Dense)", "0.076", "0.095", "0.220"),
("Transformer (Sparse MoE)", "0.099", "0.091", "0.249"),
("CNN", "0.068", "0.087", "0.198"),
("State Space Models", "0.084", "0.093", "0.231"),
("Recurrent Neural Networks", "0.059", "0.083", "0.180")
)
#figure(
block(width: 100%)[
#tablex(
columns: (1fr, 1fr, 1fr, 1fr),
align: (x, y) => (left, center).at(x),
cellx(fill: gray.lighten(80%))[*Architecture Family*],
cellx(fill: gray.lighten(80%))[*Parameter Scaling ($alpha_N$)*],
cellx(fill: gray.lighten(80%))[*Data Scaling ($alpha_D$)*],
cellx(fill: gray.lighten(80%))[*Compute Scaling ($alpha_C$)*],
..scaling_data.flatten()
)
],
caption: [Empirical scaling exponents across neural network architectures]
)
Examples include:
- In-context learning
- Multi-step reasoning
- Zero-shot instruction following
- Tool use