|
|
Line 29: |
Line 29: |
| *** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math> | | *** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math> |
| *** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math> | | *** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math> |
| | |
| === 3. Algorithms === | | === 3. Algorithms === |
| ==== Adafactor for Weighted Vectors ==== | | ==== Adafactor for Weighted Vectors ==== |
Line 69: |
Line 70: |
|
| |
|
| === 4. Proposed Hyperparameters for Adafactor === | | === 4. Proposed Hyperparameters for Adafactor === |
| * '''Regularization constant 1''': <math>\epsilon_1 = 10^{-30}</math> | | * Regularization constant 1: <math>\epsilon_1 = 10^{-30}</math> |
| * Ensures numerical stability by preventing division by zero in the calculation of second-moment estimates, so the numerical value should be very close to zero | | * Regularization constant 2: <math>\epsilon_2 = 10^{-3}</math> |
| * '''Regularization constant 2''': <math>\epsilon_2 = 10^{-3}</math>
| | * Clipping threshold: <math>d = 1</math> |
| * Help to stabilize parameter updates by controlling the effect of second-moment scaling in low-magnitude scenarios. Compared to <math>\epsilon_2</math>, a relatively larger value ensures the stability of noise and low-magnitude scenarios. | | * Relative step size: <math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> |
| * '''Clipping threshold''': <math>d = 1</math>
| | * Second moment decay: <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> |
| * A threshold of 1 balances stability and learning efficiency. It avoids excessive suppression of large gradients, which could hinder learning, while still protecting against extreme updates that could destabilize the model.
| | |
| * '''Relative step size''': <math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math>
| | == Numerical Examples == |
| ** <math>min(10^-2, ...)</math> can caps the learning rate at 10^-2, which is a empirical found for upper bound
| | Step-by-step instructions for determining the result of the first iteration. |
| ** <math>\frac{1}{\sqrt{t}}</math> This step size promote convergence of the model. This rate ensures a balance between sufficient exploration in early iteration and stability in later iterations
| | |
| * '''Second moment decay''': <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>
| | '''<big>Problem setup</big>''' |
| ** 1-...: ensures the decay factor remains close to 1
| | |
| ** <math>t^{-0,8}</math> the power 0.8 ensures a balance between rapid adaptation in early training and later iterations
| | '''Initial weights ('''<math>X_0</math>'''):''' |
| | |
| | <math>X_0 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}</math> |
| | |
| | '''Gradient (<math>G_t</math>):''' |
| | |
| | <math>G_t = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math> |
| | |
| | |
| | '''<big>Hyperparameters setup</big>''' |
| | |
| | <math>\epsilon_1 = 10^{-30}</math> (Minimum learning rate scaling factor)) |
| | |
| | <math>\epsilon_2 = 10^{-3}</math> (Regularization constant) |
| | |
| | <math>d = 1</math> (Clipping threshold) |
| | |
| | <math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> (Relative step size) |
| | |
| | <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> (Second moment decay) |
| | |
| | |
| | '''<big>Step 1: Learning Rate Scaling</big>''' |
| | |
| | Define the relative step size |
| | |
| | <math>\rho_t = \min(10^{-2}, 1/\sqrt{1})= 10^{-2}</math> |
| | |
| | '''Step 1.1: Root Mean Square(RMS) calculation for <math>X_0</math>''' |
| | |
| | Root Mean Square(RMS) calculation for <math>X_0</math> |
| | |
| | RMS formula |
|
| |
|
| === 5.Discussion === | | <math>RMS(X_0) = \sqrt{\tfrac{1}{n}\textstyle \sum_{i=1}^n\displaystyle X_0[i]^2}</math> |
|
| |
|
| ==== Why Clipping ====
| | Substitute the initial weights |
| Adafactor employs clipping to maintain numerical stability, especially since it is designed for use with very large models and often works with unscaled learning rates.
| |
| * Clipping prevents the update step from becoming very large, which would destabilize training
| |
| * Clipping mitigates the effects of very large gradients preventing numerical instability
| |
| Therefore, implementing clipping helps ensure stability and efficient training without requiring per-parameter scaling like Adam.
| |
|
| |
|
| ==== Why Adafactor is more memory efficient, compared to Adam ====
| | <math>RMS(X_0) = \sqrt{\tfrac{1}{9}(0.72^2+(-0.5)^2+0.9^2+(-1.1)^2+0.8^2+(-0.6)^2+1.2^2+(-0.7)^2+0.4^2)}</math> |
| '''Row-wise and Column-wise Second Moment Updates'''
| |
| *<math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
| |
| *<math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
| |
| Instead of storing the full <math>G_t^2</math>, Adafactor computes the row and column respectively, which reduces the memory requirements from <math>O(n\times m)</math> to <math>O(n + m)</math>
| |
|
| |
|
| '''Factored Representation of the Second Moment'''
| | <math>RMS(X_0) = \sqrt{\frac{6.85}{9}}\approx 0.806</math> |
| * <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
| | |
| This updates the second momentum based on the outer product <math>R_t C_t</math>.
| | Find the Learning Rate Scaling (αt): |
| *However, this is not <math>O(n\times m)</math> since
| |
| ** The operation is performed element-wise, so it actually never materializes <math>\hat{V_t}</math> as a <math>n\times n</math> matrix
| |
| ** It also only storing <math>R_t</math>and <math> C_t</math> instead of storage the full second-moment matrix
| |
|
| |
|
| == Numerical Examples ==
| |
| == Applications == | | == Applications == |
| == Conclusion == | | == Conclusion == |
| == Reference == | | == Reference == |
Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)
Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu
Introduction
Problem formulation
1. Objective
Minimize the loss function , where and is the weight vector to be optimized.
2. Parameters
- Where:
- is the running average of the squared gradient.
- is the corrected decay parameter.
- is a regularization constant.
- Where:
- is the relative step size.
- is a regularization constant.
- is the root mean square, defined as:
3. Algorithms
Adafactor for Weighted Vectors
Inputs:
- Initial point:
- Relative step sizes: for to
- Second moment decay: for to , with
- Regularization constants:
- Clipping threshold:
Algorithm:
- For to :
- Compute adaptive step size:
- Compute gradient:
- Update second moment estimate:
- Compute normalized gradient:
- Apply clipping:
- Update parameter:
- End for
Adafactor for Weighted Matrices
Inputs:
- Initial point:
- Relative step sizes: for to
- Second moment decay: for to , with
- Regularization constants:
- Clipping threshold:
Algorithm:
- For to :
- Compute adaptive step size:
- Compute gradient:
- Update row-wise second moment:
- Update column-wise second moment:
- Update overall second moment estimate:
- Compute normalized gradient:
- Apply clipping:
- Update parameter:
- End for
4. Proposed Hyperparameters for Adafactor
- Regularization constant 1:
- Regularization constant 2:
- Clipping threshold:
- Relative step size:
- Second moment decay:
Numerical Examples
Step-by-step instructions for determining the result of the first iteration.
Problem setup
Initial weights ():
Gradient ():
Hyperparameters setup
(Minimum learning rate scaling factor))
(Regularization constant)
(Clipping threshold)
(Relative step size)
(Second moment decay)
Step 1: Learning Rate Scaling
Define the relative step size
Step 1.1: Root Mean Square(RMS) calculation for
Root Mean Square(RMS) calculation for
RMS formula
Substitute the initial weights
Find the Learning Rate Scaling (αt):
Applications
Conclusion
Reference