Notes on Scaling Laws and Generalisation
Kaplan et al.'s scaling laws established that language model loss decreases predictably with compute, data, and parameters. But predictability of loss does not imply predictability of capabilities.
The Smoothness Assumption
Scaling laws assume smooth power-law relationships. Yet capability emergence — sudden jumps in performance on specific tasks — suggests that the loss landscape and the capability landscape are not identical.
This matters for research planning. If we extrapolate from loss curves alone, we may miss discontinuities that define what models can and cannot do.
A Working Hypothesis
Generalisation may depend less on scale alone and more on the structure of the training distribution relative to the structure of the task. Two models with identical loss can differ radically in out-of-distribution behaviour.
This note is intentionally incomplete. It will grow as the thinking develops.