Adafactor: Difference between revisions

VisualWikitext

Revision as of 23:26, 10 December 2024

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

Introduction

Problem formulation

1. Objective

Minimize the loss function $f(x)$ , where $x\in R^{n}$ and $x$ is the weight vector to be optimized.

2. Parameters

Gradient:

$G_{t}=\nabla f(x_{t-1})$

Second moment estimate:

${\hat {V}}_{t}={\hat {\beta }}_{2t}{\hat {V}}_{t-1}+(1-{\hat {\beta }}_{2t})(G_{t}^{2}+\epsilon _{1}1_{n})$

Where:
- ${\hat {V}}_{t}$ is the running average of the squared gradient.
- ${\hat {\beta }}_{2t}$ is the corrected decay parameter.
- $\epsilon _{1}$ is a regularization constant.

Step size:

$\alpha _{t}=\max(\epsilon _{2},{\text{RMS}}(x_{t-1}))\rho _{t}$

Where:
- $\rho _{t}$ is the relative step size.
- $\epsilon _{2}$ is a regularization constant.
- ${\text{RMS}}$ ${\text{RMS}}$ is the root mean square, defined as:
  - $u_{xt}={\frac {-g_{xt}}{\sqrt {{\hat {v}}_{xt}}}}$
  - ${\text{RMS}}(U_{t})={\text{RMS}}_{x\in X}(u_{xt})={\sqrt {{\text{Mean}}_{x\in X}\left({\frac {(g_{xt})^{2}}{{\hat {v}}_{xt}}}\right)}}$

3. Algorithms

Adafactor for Weighted Vectors

Inputs:

Initial point: $X_{0}\in \mathbb {R} ^{n}$
Relative step sizes: $\rho _{t}$ for $t=1$ to $T$
Second moment decay: ${\hat {\beta }}_{2t}$ for $t=1$ to $T$ , with ${\hat {\beta }}_{21}=0$
Regularization constants: $\epsilon _{1},\epsilon _{2}$
Clipping threshold: $d$

Algorithm:

For $t=1$ $t=1$ to $T$ $T$ :
- Compute adaptive step size: $\alpha _{t}=\max(\epsilon _{2},{\text{RMS}}(X_{t-1}))\rho _{t}$
- Compute gradient: $G_{t}=\nabla f_{t}(X_{t-1})$
- Update second moment estimate: ${\hat {V}}_{t}={\hat {\beta }}_{2t}{\hat {V}}_{t-1}+(1-{\hat {\beta }}_{2t})(G_{t}^{2}+\epsilon _{1}1_{n})$
- Compute normalized gradient: $U_{t}={\frac {G_{t}}{\sqrt {{\hat {V}}_{t}}}}$
- Apply clipping: ${\hat {U}}_{t}={\frac {U_{t}}{\max(1,{\text{RMS}}(U_{t})/d)}}$
- Update parameter: $X_{t}=X_{t-1}-\alpha _{t}{\hat {U}}_{t}$
End for

Adafactor for Weighted Matrices

Inputs:

Initial point: $X_{0}\in \mathbb {R} ^{n\times m}$
Relative step sizes: $\rho _{t}$ for $t=1$ to $T$
Second moment decay: ${\hat {\beta }}_{2t}$ for $t=1$ to $T$ , with ${\hat {\beta }}_{21}=0$
Regularization constants: $\epsilon _{1},\epsilon _{2}$
Clipping threshold: $d$

Algorithm:

For $t=1$ $t=1$ to $T$ $T$ :
- Compute adaptive step size: $\alpha _{t}=\max(\epsilon _{2},{\text{RMS}}(X_{t-1}))\rho _{t}$
- Compute gradient: $G_{t}=\nabla f_{t}(X_{t-1})$
- Update row-wise second moment: $R_{t}={\hat {\beta }}_{2t}R_{t-1}+(1-{\hat {\beta }}_{2t})(G_{t}^{2}+\epsilon _{1}1_{n}1_{m}^{T})1_{m}$
- Update column-wise second moment: $C_{t}={\hat {\beta }}_{2t}C_{t-1}+(1-{\hat {\beta }}_{2t})1_{n}^{T}(G_{t}^{2}+\epsilon _{1}1_{n}1_{m}^{T})$
- Update overall second moment estimate: ${\hat {V}}_{t}={\frac {R_{t}C_{t}}{1_{n}^{T}R_{t}}}$
- Compute normalized gradient: $U_{t}={\frac {G_{t}}{\sqrt {{\hat {V}}_{t}}}}$
- Apply clipping: ${\hat {U}}_{t}={\frac {U_{t}}{\max(1,{\text{RMS}}(U_{t})/d)}}$
- Update parameter: $X_{t}=X_{t-1}-\alpha _{t}{\hat {U}}_{t}$
End for

4. Proposed Hyperparameters for Adafactor

Regularization constant 1: $\epsilon _{1}=10^{-30}$
Regularization constant 2: $\epsilon _{2}=10^{-3}$
Clipping threshold: $d=1$
Relative step size: $\rho _{t}=\min(10^{-2},1/{\sqrt {t}})$
Second moment decay: ${\hat {\beta }}_{2t}=1-t^{-0.8}$

Numerical Examples

Step-by-step instructions for determining the result of the first iteration.

Problem setup

Initial weights ( $X_{0}$ ):

$X_{0}={\begin{bmatrix}0.7&-0.5&0.9\\-1.1&0.8&-1.6\\1.2&-0.7&0.4\end{bmatrix}}$

Gradient ( $G_{t}$ ):

$G_{t}={\begin{bmatrix}0.3&-0.2&0.4\\-0.5&0.6&-0.1\\0.2&-0.4&0.3\end{bmatrix}}$

Hyperparameters setup

$\epsilon _{1}=10^{-30}$ (Minimum learning rate scaling factor))

$\epsilon _{2}=10^{-3}$ (Regularization constant)

$d=1$ (Clipping threshold)

$\rho _{t}=\min(10^{-2},1/{\sqrt {t}})$ (Relative step size)

${\hat {\beta }}_{2t}=1-t^{-0.8}$ (Second moment decay)

Step 1: Learning Rate Scaling

Define the relative step size

$\rho _{t}=\min(10^{-2},1/{\sqrt {1}})=10^{-2}$

Step 1.1: Root Mean Square(RMS) calculation for $X_{0}$

Root Mean Square(RMS) calculation for $X_{0}$

RMS formula

$RMS(X_{0})={\sqrt {{\tfrac {1}{n}}\textstyle \sum _{i=1}^{n}\displaystyle X_{0}[i]^{2}}}$

Substitute the initial weights

$RMS(X_{0})={\sqrt {{\tfrac {1}{9}}(0.72^{2}+(-0.5)^{2}+0.9^{2}+(-1.1)^{2}+0.8^{2}+(-0.6)^{2}+1.2^{2}+(-0.7)^{2}+0.4^{2})}}$

$RMS(X_{0})={\sqrt {\frac {6.85}{9}}}\approx 0.806$

Find the Learning Rate Scaling (αt):

@@ Line 29: / Line 29: @@
 *** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
 *** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>
 === 3. Algorithms ===
 ==== Adafactor for Weighted Vectors ====
@@ Line 69: / Line 70: @@
 === 4. Proposed Hyperparameters for Adafactor ===
-* '''Regularization constant 1''': <math>\epsilon_1 = 10^{-30}</math>
+* Regularization constant 1: <math>\epsilon_1 = 10^{-30}</math>
-* Ensures numerical stability by preventing division by zero in the calculation of second-moment estimates, so the numerical value should be very close to zero
+* Regularization constant 2: <math>\epsilon_2 = 10^{-3}</math>
-* '''Regularization constant 2''': <math>\epsilon_2 = 10^{-3}</math>
+* Clipping threshold: <math>d = 1</math>
-* Help to stabilize parameter updates by controlling the effect of second-moment scaling in low-magnitude scenarios. Compared to <math>\epsilon_2</math>, a relatively larger value ensures the stability of noise and low-magnitude scenarios.
+* Relative step size: <math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math>
-* '''Clipping threshold''': <math>d = 1</math>
+* Second moment decay: <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>
-* A threshold of 1 balances stability and learning efficiency. It avoids excessive suppression of large gradients, which could hinder learning, while still protecting against extreme updates that could destabilize the model.
-* '''Relative step size''': <math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math>
+== Numerical Examples ==
-** <math>min(10^-2, ...)</math> can caps the learning rate at 10^-2, which is a empirical found for upper bound
+Step-by-step instructions for determining the result of the first iteration.
-** <math>\frac{1}{\sqrt{t}}</math> This step size promote convergence of the model. This rate ensures a balance between sufficient exploration in early iteration and stability in later iterations
-* '''Second moment decay''': <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>
+'''<big>Problem setup</big>'''
-** 1-...: ensures the decay factor remains close to 1
-** <math>t^{-0,8}</math> the power 0.8 ensures a balance between rapid adaptation in early training and later iterations
+'''Initial weights ('''<math>X_0</math>'''):'''
+<math>X_0 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}</math>
+'''Gradient (<math>G_t</math>):'''
+<math>G_t = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math>
+'''<big>Hyperparameters setup</big>'''
+<math>\epsilon_1 = 10^{-30}</math> (Minimum learning rate scaling factor))
+<math>\epsilon_2 = 10^{-3}</math> (Regularization constant)
+<math>d = 1</math> (Clipping threshold)
+<math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> (Relative step size)
+<math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> (Second moment decay)
+'''<big>Step 1:  Learning Rate Scaling</big>'''
+Define the relative step size
+<math>\rho_t = \min(10^{-2}, 1/\sqrt{1})= 10^{-2}</math>
+'''Step 1.1: Root Mean Square(RMS) calculation for <math>X_0</math>'''
+Root Mean Square(RMS) calculation for <math>X_0</math>
+RMS formula
-=== 5.Discussion ===
+<math>RMS(X_0) = \sqrt{\tfrac{1}{n}\textstyle \sum_{i=1}^n\displaystyle  X_0[i]^2}</math>
-==== Why Clipping ====
+Substitute the initial weights
-Adafactor employs clipping to maintain numerical stability, especially since it is designed for use with very large models and often works with unscaled learning rates.
-* Clipping prevents the update step from becoming very large, which would destabilize training
-* Clipping mitigates the effects of very large gradients preventing numerical instability
-Therefore, implementing clipping helps ensure stability and efficient training without requiring per-parameter scaling like Adam.
-==== Why Adafactor is more memory efficient, compared to Adam ====
+<math>RMS(X_0) = \sqrt{\tfrac{1}{9}(0.72^2+(-0.5)^2+0.9^2+(-1.1)^2+0.8^2+(-0.6)^2+1.2^2+(-0.7)^2+0.4^2)}</math>
-'''Row-wise and Column-wise Second Moment Updates'''
-*<math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
-*<math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
-Instead of storing the full <math>G_t^2</math>, Adafactor computes the row and column respectively, which reduces the memory requirements from <math>O(n\times m)</math> to <math>O(n + m)</math>
-'''Factored Representation of the Second Moment'''
+<math>RMS(X_0) = \sqrt{\frac{6.85}{9}}\approx 0.806</math>
-* <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
-This updates the second momentum based on the outer product <math>R_t C_t</math>.
+Find the Learning Rate Scaling (αt):
-*However, this is not <math>O(n\times m)</math> since
-** The operation is performed element-wise, so it actually never materializes <math>\hat{V_t}</math> as a <math>n\times n</math> matrix
-** It also only storing <math>R_t</math>and <math> C_t</math> instead of storage the full second-moment matrix
-== Numerical Examples ==
 == Applications ==
 == Conclusion ==
 == Reference ==

Adafactor: Difference between revisions

Revision as of 23:26, 10 December 2024

Contents

Introduction

Problem formulation

1. Objective

2. Parameters

3. Algorithms

Adafactor for Weighted Vectors

Adafactor for Weighted Matrices

4. Proposed Hyperparameters for Adafactor

Numerical Examples

Applications

Conclusion

Reference

Navigation menu