- loss scales(cross entropy loss) as a power-law w.r.t model size, dataset size and compute. network/arch depth have minimal effects. simple eqn are reason for overfitting on model/dataset size.
Summary:
N: no. of model parameters(w/o embeddings)
D: size of dataset(in tokens)
C: compute used for training(6NBS)
L: cross entropy loss
B_crit: critical batch size(in tokens, it should be)
C_min: mini. amt. of non-embed compute to reach a given loss
S_min: mini. no. of training steps needed to reach a given loss. This is also the number of training steps that would be used if the model were trained at a batch size much greater than the critical batch size.
$\alpha$_x: power-law exponents. L(X) ∝ 1/X αX
- performance depends on scale and weakly on model shape i.e it depends on N, D and C
- overfitting will occur, if either N or D is held fixed while the other increases. performance increase as long as both are increased. increase in 8x model size → 5x data size increase needed. (N^0.74/D) is the ratio
- large models are more sample efficient than small models.
- convergence is inefficient for large models. with D $\backsim$C^0.27
- optimal batch size is roughly a power of the loss only.
- if compute is infinite, allocate most of the compute to model size.
