Adafactor: Difference between revisions

From Cornell University Computational Optimization Open Textbook - Optimization Wiki
Jump to navigation Jump to search
Line 29: Line 29:
*** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
*** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
*** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>
*** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>
=== 3. Algorithms ===
=== 3. Algorithms ===
==== Adafactor for Weighted Vectors ====
==== Adafactor for Weighted Vectors ====
Line 69: Line 70:


=== 4. Proposed Hyperparameters for Adafactor ===
=== 4. Proposed Hyperparameters for Adafactor ===
* '''Regularization constant 1''': <math>\epsilon_1 = 10^{-30}</math>
* Regularization constant 1: <math>\epsilon_1 = 10^{-30}</math>
* Ensures numerical stability by preventing division by zero in the calculation of second-moment estimates, so the numerical value should be very close to zero
* Regularization constant 2: <math>\epsilon_2 = 10^{-3}</math>
* '''Regularization constant 2''': <math>\epsilon_2 = 10^{-3}</math>
* Clipping threshold: <math>d = 1</math>
* Help to stabilize parameter updates by controlling the effect of second-moment scaling in low-magnitude scenarios. Compared to <math>\epsilon_2</math>, a relatively larger value ensures the stability of noise and low-magnitude scenarios.
* Relative step size: <math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math>
* '''Clipping threshold''': <math>d = 1</math>
* Second moment decay: <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>
* A threshold of 1 balances stability and learning efficiency. It avoids excessive suppression of large gradients, which could hinder learning, while still protecting against extreme updates that could destabilize the model.
 
* '''Relative step size''': <math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math>
== Numerical Examples ==
** <math>min(10^-2, ...)</math> can caps the learning rate at 10^-2, which is a empirical found for upper bound
Step-by-step instructions for determining the result of the first iteration.
** <math>\frac{1}{\sqrt{t}}</math> This step size promote convergence of the model. This rate ensures a balance between sufficient exploration in early iteration and stability in later iterations
 
* '''Second moment decay''': <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>
'''<big>Problem setup</big>'''
** 1-...: ensures the decay factor remains close to 1
 
** <math>t^{-0,8}</math> the power 0.8 ensures a balance between rapid adaptation in early training and later iterations
'''Initial weights ('''<math>X_0</math>​'''):'''
 
<math>X_0 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}</math>
 
'''Gradient (​<math>G_t</math>):'''
 
<math>G_t = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math>
 
 
'''<big>Hyperparameters setup</big>'''
 
<math>\epsilon_1 = 10^{-30}</math> (Minimum learning rate scaling factor))
 
<math>\epsilon_2 = 10^{-3}</math> (Regularization constant)
 
<math>d = 1</math> (Clipping threshold)
 
<math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> (Relative step size)
 
<math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> (Second moment decay)
 
 
'''<big>Step 1: Learning Rate Scaling</big>'''
 
Define the relative step size
 
<math>\rho_t = \min(10^{-2}, 1/\sqrt{1})= 10^{-2}</math>
 
'''Step 1.1: Root Mean Square(RMS) calculation for <math>X_0</math>'''
 
Root Mean Square(RMS) calculation for <math>X_0</math>
 
RMS formula


=== 5.Discussion ===
<math>RMS(X_0) = \sqrt{\tfrac{1}{n}\textstyle \sum_{i=1}^n\displaystyle  X_0[i]^2}</math>


==== Why Clipping ====
Substitute the initial weights
Adafactor employs clipping to maintain numerical stability, especially since it is designed for use with very large models and often works with unscaled learning rates.
* Clipping prevents the update step from becoming very large, which would destabilize training
* Clipping mitigates the effects of very large gradients preventing numerical instability
Therefore, implementing clipping helps ensure stability and efficient training without requiring per-parameter scaling like Adam.


==== Why Adafactor is more memory efficient, compared to Adam ====
<math>RMS(X_0) = \sqrt{\tfrac{1}{9}(0.72^2+(-0.5)^2+0.9^2+(-1.1)^2+0.8^2+(-0.6)^2+1.2^2+(-0.7)^2+0.4^2)}</math>
'''Row-wise and Column-wise Second Moment Updates'''
*<math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
*<math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
Instead of storing the full <math>G_t^2</math>, Adafactor computes the row and column respectively, which reduces the memory requirements from <math>O(n\times m)</math> to <math>O(n + m)</math>


'''Factored Representation of the Second Moment'''
<math>RMS(X_0) = \sqrt{\frac{6.85}{9}}\approx 0.806</math>
* <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
 
This updates the second momentum based on the outer product <math>R_t C_t</math>.
Find the Learning Rate Scaling (αt​):
*However, this is not <math>O(n\times m)</math> since
** The operation is performed element-wise, so it actually never materializes <math>\hat{V_t}</math> as a <math>n\times n</math> matrix
** It also only storing <math>R_t</math>and <math> C_t</math> instead of storage the full second-moment matrix


== Numerical Examples ==
== Applications ==
== Applications ==
== Conclusion ==
== Conclusion ==
== Reference ==
== Reference ==

Revision as of 23:26, 10 December 2024

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

Introduction

Problem formulation

1. Objective

Minimize the loss function , where and is the weight vector to be optimized.

2. Parameters

  • Gradient:

  • Second moment estimate:

  • Where:
    • is the running average of the squared gradient.
    • is the corrected decay parameter.
    • is a regularization constant.
  • Step size:

  • Where:
    • is the relative step size.
    • is a regularization constant.
    • is the root mean square, defined as:

3. Algorithms

Adafactor for Weighted Vectors

Inputs:

  • Initial point:
  • Relative step sizes: for to
  • Second moment decay: for to , with
  • Regularization constants:
  • Clipping threshold:

Algorithm:

  • For to :
    • Compute adaptive step size:
    • Compute gradient:
    • Update second moment estimate:
    • Compute normalized gradient:
    • Apply clipping:
    • Update parameter:
  • End for

Adafactor for Weighted Matrices

Inputs:

  • Initial point:
  • Relative step sizes: for to
  • Second moment decay: for to , with
  • Regularization constants:
  • Clipping threshold:

Algorithm:

  • For to :
    • Compute adaptive step size:
    • Compute gradient:
    • Update row-wise second moment:
    • Update column-wise second moment:
    • Update overall second moment estimate:
    • Compute normalized gradient:
    • Apply clipping:
    • Update parameter:
  • End for

4. Proposed Hyperparameters for Adafactor

  • Regularization constant 1:
  • Regularization constant 2:
  • Clipping threshold:
  • Relative step size:
  • Second moment decay:

Numerical Examples

Step-by-step instructions for determining the result of the first iteration.

Problem setup

Initial weights ():

Gradient (​):


Hyperparameters setup

(Minimum learning rate scaling factor))

(Regularization constant)

(Clipping threshold)

(Relative step size)

(Second moment decay)


Step 1: Learning Rate Scaling

Define the relative step size

Step 1.1: Root Mean Square(RMS) calculation for

Root Mean Square(RMS) calculation for

RMS formula

Substitute the initial weights

Find the Learning Rate Scaling (αt​):

Applications

Conclusion

Reference