Adafactor: Difference between revisions
Line 29: | Line 29: | ||
*** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math> | *** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math> | ||
*** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math> | *** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math> | ||
=== 3. Algorithms === | === 3. Algorithms === | ||
==== Adafactor for Weighted Vectors ==== | ==== Adafactor for Weighted Vectors ==== | ||
Line 69: | Line 70: | ||
=== 4. Proposed Hyperparameters for Adafactor === | === 4. Proposed Hyperparameters for Adafactor === | ||
* | * Regularization constant 1: <math>\epsilon_1 = 10^{-30}</math> | ||
* | * Regularization constant 2: <math>\epsilon_2 = 10^{-3}</math> | ||
* Clipping threshold: <math>d = 1</math> | |||
* | * Relative step size: <math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> | ||
* Second moment decay: <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> | |||
== Numerical Examples == | |||
Step-by-step instructions for determining the result of the first iteration. | |||
'''<big>Problem setup</big>''' | |||
'''Initial weights ('''<math>X_0</math>'''):''' | |||
<math>X_0 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}</math> | |||
'''Gradient (<math>G_t</math>):''' | |||
<math>G_t = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math> | |||
'''<big>Hyperparameters setup</big>''' | |||
<math>\epsilon_1 = 10^{-30}</math> (Minimum learning rate scaling factor)) | |||
<math>\epsilon_2 = 10^{-3}</math> (Regularization constant) | |||
<math>d = 1</math> (Clipping threshold) | |||
<math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> (Relative step size) | |||
<math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> (Second moment decay) | |||
'''<big>Step 1: Learning Rate Scaling</big>''' | |||
Define the relative step size | |||
<math>\rho_t = \min(10^{-2}, 1/\sqrt{1})= 10^{-2}</math> | |||
'''Step 1.1: Root Mean Square(RMS) calculation for <math>X_0</math>''' | |||
Root Mean Square(RMS) calculation for <math>X_0</math> | |||
RMS formula | |||
== | <math>RMS(X_0) = \sqrt{\tfrac{1}{n}\textstyle \sum_{i=1}^n\displaystyle X_0[i]^2}</math> | ||
Substitute the initial weights | |||
<math>RMS(X_0) = \sqrt{\tfrac{1}{9}(0.72^2+(-0.5)^2+0.9^2+(-1.1)^2+0.8^2+(-0.6)^2+1.2^2+(-0.7)^2+0.4^2)}</math> | |||
<math>RMS(X_0) = \sqrt{\frac{6.85}{9}}\approx 0.806</math> | |||
Find the Learning Rate Scaling (αt): | |||
== Applications == | == Applications == | ||
== Conclusion == | == Conclusion == | ||
== Reference == | == Reference == |
Revision as of 00:26, 11 December 2024
Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)
Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu
Introduction
Problem formulation
1. Objective
Minimize the loss function , where and is the weight vector to be optimized.
2. Parameters
- Gradient:
- Second moment estimate:
- Where:
- is the running average of the squared gradient.
- is the corrected decay parameter.
- is a regularization constant.
- Step size:
- Where:
- is the relative step size.
- is a regularization constant.
- is the root mean square, defined as:
3. Algorithms
Adafactor for Weighted Vectors
Inputs:
- Initial point:
- Relative step sizes: for to
- Second moment decay: for to , with
- Regularization constants:
- Clipping threshold:
Algorithm:
- For to :
- Compute adaptive step size:
- Compute gradient:
- Update second moment estimate:
- Compute normalized gradient:
- Apply clipping:
- Update parameter:
- End for
Adafactor for Weighted Matrices
Inputs:
- Initial point:
- Relative step sizes: for to
- Second moment decay: for to , with
- Regularization constants:
- Clipping threshold:
Algorithm:
- For to :
- Compute adaptive step size:
- Compute gradient:
- Update row-wise second moment:
- Update column-wise second moment:
- Update overall second moment estimate:
- Compute normalized gradient:
- Apply clipping:
- Update parameter:
- End for
4. Proposed Hyperparameters for Adafactor
- Regularization constant 1:
- Regularization constant 2:
- Clipping threshold:
- Relative step size:
- Second moment decay:
Numerical Examples
Step-by-step instructions for determining the result of the first iteration.
Problem setup
Initial weights ():
Gradient ():
Hyperparameters setup
(Minimum learning rate scaling factor))
(Regularization constant)
(Clipping threshold)
(Relative step size)
(Second moment decay)
Step 1: Learning Rate Scaling
Define the relative step size
Step 1.1: Root Mean Square(RMS) calculation for
Root Mean Square(RMS) calculation for
RMS formula
Substitute the initial weights
Find the Learning Rate Scaling (αt):