Scaling Laws for Neural Language Models

loss scales(cross entropy loss) as a power-law w.r.t model size, dataset size and compute. network/arch depth have minimal effects. simple eqn are reason for overfitting on model/dataset size.

Summary:

N: no. of model parameters(w/o embeddings)

D: size of dataset(in tokens)

C: compute used for training(6NBS)

L: cross entropy loss

B_crit: critical batch size(in tokens, it should be)

C_min: mini. amt. of non-embed compute to reach a given loss

S_min: mini. no. of training steps needed to reach a given loss. This is also the number of training steps that would be used if the model were trained at a batch size much greater than the critical batch size.

$\alpha$_x: power-law exponents. L(X) ∝ 1/X αX

performance depends on scale and weakly on model shape i.e it depends on N, D and C
overfitting will occur, if either N or D is held fixed while the other increases. performance increase as long as both are increased. increase in 8x model size → 5x data size increase needed. (N^0.74/D) is the ratio
large models are more sample efficient than small models.
convergence is inefficient for large models. with D $\backsim$C^0.27
optimal batch size is roughly a power of the loss only.
if compute is infinite, allocate most of the compute to model size.