TLDR: This research introduces a novel weight decay scaling rule for AdamW optimizers that significantly improves the transferability of hyperparameters across different model widths in large language models. By ensuring that the “sublayer gain” remains consistent, the method allows for “zero-shot transfer” of learning rates and weight decays from smaller “proxy” models to much larger “target” models, moving beyond the limitations of previous methods that only addressed early-stage training dynamics.
Training large language models efficiently is a complex task, often requiring extensive fine-tuning of hyperparameters like learning rates and weight decay. While existing methods like Maximal-update Parameterization (µP) have provided valuable guidance for scaling models, they primarily focus on the initial stages of training. This new research, titled Robust Layerwise Scaling Rules by Proper Weight Decay Tuning, delves into the later, steady-state phase of training, offering a crucial advancement for optimizing large-scale models.
The paper, authored by Zhiyuan Fan, Yifeng Liu, Qingyue Zhao, Angela Yuan, and Quanquan Gu, addresses a significant challenge: as modern neural networks, especially those with normalization layers, train for longer, they enter a ‘steady state’ where the optimizer dictates the dynamics. In this regime, the effective learning rate can become dependent on the model’s width, undermining the benefits of µP and making it difficult to transfer optimal hyperparameters from smaller ‘proxy’ models to larger ‘target’ models without re-tuning.
The Problem with Current Scaling
Traditional µP helps ensure that the magnitude of parameter updates remains consistent as a model widens. However, normalization layers, while beneficial for training stability, introduce a ‘backward scale sensitivity.’ This means that even if µP works perfectly at the beginning, the gradients can become width-dependent in the steady state, causing the effective learning rate to vary with model size. This necessitates costly per-width hyperparameter sweeps, slowing down the development of larger models.
A Novel Weight Decay Scaling Rule
The core contribution of this research is the introduction of a specific weight decay scaling rule for the AdamW optimizer. The authors observed that in the steady state of AdamW training, the singular values (which describe how much a matrix stretches or shrinks vectors) of weight matrices scale proportionally to the square root of the learning rate divided by the weight decay (√η/λ). Crucially, the *shape* of this spectrum remains largely invariant.
Building on this, they found that for matrix-like parameters (like those in attention and feed-forward layers), the top singular value scales approximately with √η/λ multiplied by d^0.75, where ‘d’ is the model width. To maintain consistent ‘sublayer gain’ (the ratio of output to input scale for a layer) across different model widths, they combined this observation with the existing µP learning rate rule for matrix-like parameters (η₂ ∝ d⁻¹). This combination led to a new empirical weight decay scaling rule: λ₂ ∝ √d.
The Proposed Layerwise Scaling Rules
The research proposes a comprehensive set of scaling rules:
- Vector-like parameters (e.g., embeddings, LayerNorm gains, biases): These should use a constant learning rate (η₁ = Θd(1)) and zero weight decay (λ₁ = 0), meaning their values are independent of model width.
- Matrix-like parameters (e.g., dense projections in attention and FFN blocks): These should have a learning rate that scales inversely with model width (η₂ ∝ d⁻¹) and a weight decay that scales with the square root of model width (λ₂ ∝ √d).
This new scheme enables ‘zero-shot transfer’ of both learning rate and weight decay. This means that hyperparameters tuned on a smaller ‘proxy’ model can be directly applied to a much larger ‘target’ model without needing additional sweeps, significantly streamlining the training process for large models.
Validation and Practical Implications
The authors validated their rules on LLaMA-style Transformer models and in a minimal synthetic setting, demonstrating that the proposed scaling successfully maintains sublayer gain invariance across different model widths. They also provide a simple diagnostic—matching top singular values—to verify this invariance.
An interesting finding is the strong correlation between learning rate and weight decay. The research shows that there’s a ‘diagonal ridge’ of near-optimal configurations, implying that increasing one often requires decreasing the other to maintain performance. This suggests that a full two-dimensional hyperparameter search might not always be necessary; one could heuristically select a weight decay and then perform a one-dimensional sweep over the learning rate, transferring the results using the new scaling rules.
Also Read:
- SNOO: Supercharging Large Language Model Training with Nesterov Momentum
- TokenTiming: Accelerating LLM Inference with Universal Speculative Decoding
Looking Ahead
This work extends the applicability of µP beyond the early-initialization phase, providing a practical recipe for robust hyperparameter transfer under AdamW. While the current findings are specific to AdamW and LLaMA architectures, the methodology of inspecting sublayer gain using singular value spectra offers a transferable procedure for future research into other optimizers, architectures, and scaling regimes.