Tag: logarithmic
-
Logarithmic-time Schedules for Scaling Language Models with Momentum
Logarithmic-time Schedules for Scaling Language Models with Momentum arXiv:2602.05298v1 Announce Type: new Abstract: In practice, the hyperparameters $(beta_1, beta_2)$ and weight-decay $lambda$ in AdamW are typically kept at fixed values. Is there any reason to do otherwise? We show that for large-scale language model training, the answer is yes: by exploiting the power-law structure of…