Adafactor: Difference between revisions

From Cornell University Computational Optimization Open Textbook - Optimization Wiki
Jump to navigation Jump to search
Line 79: Line 79:
*<math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
*<math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
Instead of storing the full <math>G_t^2</math> Adafactor computes the row and column respectively, which reduces the memory requirements from <math>O(n\times m)</math> to <math>O(n + m)</math>
Instead of storing the full <math>G_t^2</math> Adafactor computes the row and column respectively, which reduces the memory requirements from <math>O(n\times m)</math> to <math>O(n + m)</math>
* <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>


=== 4. Proposed Hyperparameters for Adafactor ===
=== 4. Proposed Hyperparameters for Adafactor ===

Revision as of 00:08, 11 December 2024

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

Introduction

Problem formulation

1. Objective

Minimize the loss function $ f(x) $, where $ x \in R^n $ and $ x $ is the weight vector to be optimized.

2. Parameters

  • Gradient:

$ G_t = \nabla f(x_{t-1}) $

  • Second moment estimate:

$ \hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n) $

  • Where:
    • $ \hat{V}_t $ is the running average of the squared gradient.
    • $ \hat{\beta}_{2t} $ is the corrected decay parameter.
    • $ \epsilon_1 $ is a regularization constant.
  • Step size:

$ \alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t $

  • Where:
    • $ \rho_t $ is the relative step size.
    • $ \epsilon_2 $ is a regularization constant.
    • $ \text{RMS} $ is the root mean square, defined as:
      • $ u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}} $
      • $ \text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)} $

3. Algorithms

Adafactor for Weighted Vectors

Inputs:

  • Initial point: $ X_0 \in \mathbb{R}^n $
  • Relative step sizes: $ \rho_t $ for $ t = 1 $ to $ T $
  • Second moment decay: $ \hat{\beta}_{2t} $ for $ t = 1 $ to $ T $, with $ \hat{\beta}_{21} = 0 $
  • Regularization constants: $ \epsilon_1, \epsilon_2 $
  • Clipping threshold: $ d $

Algorithm:

  • For $ t = 1 $ to $ T $:
    • Compute adaptive step size: $ \alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t $
    • Compute gradient: $ G_t = \nabla f_t(X_{t-1}) $
    • Update second moment estimate: $ \hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n) $
    • Compute normalized gradient: $ U_t = \frac{G_t}{\sqrt{\hat{V}_t}} $
    • Apply clipping: $ \hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)} $
    • Update parameter: $ X_t = X_{t-1} - \alpha_t \hat{U}_t $
  • End for

Adafactor for Weighted Matrices

Inputs:

  • Initial point: $ X_0 \in \mathbb{R}^{n \times m} $
  • Relative step sizes: $ \rho_t $ for $ t = 1 $ to $ T $
  • Second moment decay: $ \hat{\beta}_{2t} $ for $ t = 1 $ to $ T $, with $ \hat{\beta}_{21} = 0 $
  • Regularization constants: $ \epsilon_1, \epsilon_2 $
  • Clipping threshold: $ d $

Algorithm:

  • For $ t = 1 $ to $ T $:
    • Compute adaptive step size: $ \alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t $
    • Compute gradient: $ G_t = \nabla f_t(X_{t-1}) $
    • Update row-wise second moment: $ R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m $
    • Update column-wise second moment: $ C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T) $
    • Update overall second moment estimate: $ \hat{V}_t = \frac{R_t C_t}{1_n^T R_t} $
    • Compute normalized gradient: $ U_t = \frac{G_t}{\sqrt{\hat{V}_t}} $
    • Apply clipping: $ \hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)} $
    • Update parameter: $ X_t = X_{t-1} - \alpha_t \hat{U}_t $
  • End for

Why Clipping

Adafactor employs clipping to maintain numerical stability, especially since it is designed for use with very large models and often works with unscaled learning rates.

  • Clipping prevents the update step from becoming very large, which would destabilize training
  • Clipping mitigates the effects of very large gradients preventing numerical instability

Therefore, implementing clipping helps ensure stability and efficient training without requiring per-parameter scaling like Adam.

Why Adafactor is more memory efficient, compared to Adam

  • $ R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m $
  • $ C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T) $

Instead of storing the full $ G_t^2 $ Adafactor computes the row and column respectively, which reduces the memory requirements from $ O(n\times m) $ to $ O(n + m) $

  • $ \hat{V}_t = \frac{R_t C_t}{1_n^T R_t} $

4. Proposed Hyperparameters for Adafactor

  • Regularization constant 1: $ \epsilon_1 = 10^{-30} $
  • Ensures numerical stability by preventing division by zero in the calculation of second-moment estimates, so the numerical value should be very close to zero
  • Regularization constant 2: $ \epsilon_2 = 10^{-3} $
  • Help to stabilize parameter updates by controlling the effect of second-moment scaling in low-magnitude scenarios. Compared to $ \epsilon_2 $, a relatively larger value ensures the stability of noise and low-magnitude scenarios.
  • Clipping threshold: $ d = 1 $
  • A threshold of 1 balances stability and learning efficiency. It avoids excessive suppression of large gradients, which could hinder learning, while still protecting against extreme updates that could destabilize the model.
  • Relative step size: $ \rho_t = \min(10^{-2}, 1/\sqrt{t}) $
    • $ min(10^-2, ...) $ can caps the learning rate at 10^-2, which is a empirical found for upper bound
    • $ \frac{1}{\sqrt{t}} $ This step size promote convergence of the model. This rate ensures a balance between sufficient exploration in early iteration and stability in later iterations
  • Second moment decay: $ \hat{\beta}_{2t} = 1 - t^{-0.8} $
    • 1-...: ensures the decay factor remains close to 1
    • $ t^{-0,8} $ the power 0.8 ensures a balance between rapid adaptation in early training and later iterations

Numerical Examples

Applications

Conclusion

Reference