Summary:

N: no. of model parameters(w/o embeddings)

D: size of dataset(in tokens)

C: compute used for training(6NBS)

L: cross entropy loss

B_crit: critical batch size(in tokens, it should be)

C_min: mini. amt. of non-embed compute to reach a given loss

S_min: mini. no. of training steps needed to reach a given loss. This is also the number of training steps that would be used if the model were trained at a batch size much greater than the critical batch size.

$\alpha$_x: power-law exponents. L(X) ∝ 1/X αX

image.png