Adafactor: Difference between revisions
Jump to navigation
Jump to search
Line 5: | Line 5: | ||
== Introduction == | == Introduction == | ||
== Problem formulation == | == Problem formulation == | ||
=== 1. Objective === | |||
Minimize the loss function '''f(x)''', where '''x ∈ ℝⁿ''' and '''x''' is the weight vector to be optimized. | |||
=== 2. Parameters === | |||
* '''Gradient:''' | |||
<math>G_t = \nabla f(x_{t-1})</math> | |||
* '''Second moment estimate:''' | |||
<math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math> | |||
** Where: | |||
* <math>\hat{V}_t</math> is the running average of the squared gradient. | |||
* <math>\hat{\beta}_{2t}</math> is the corrected decay parameter. | |||
* <math>\epsilon_1</math> is a regularization constant. | |||
* '''Step size:''' | |||
<math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math> | |||
** Where: | |||
* <math>\rho_t</math> is the relative step size. | |||
* <math>\epsilon_2</math> is a regularization constant. | |||
* <math>\text{RMS}</math> is the root mean square, defined as: | |||
<math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math> | |||
<math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math> | |||
=== 3. Problem Formulation === | |||
==== Adafactor for Weighted Vectors ==== | |||
'''Inputs:''' | |||
* Initial point: <math>X_0 \in \mathbb{R}^n</math> | |||
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math> | |||
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math> | |||
* Regularization constants: <math>\epsilon_1, \epsilon_2</math> | |||
* Clipping threshold: <math>d</math> | |||
'''Algorithm:''' | |||
# For <math>t = 1</math> to <math>T</math>: | |||
## Compute adaptive step size: | |||
<math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math> | |||
## Compute gradient: | |||
<math>G_t = \nabla f_t(X_{t-1})</math> | |||
## Update second moment estimate: | |||
<math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math> | |||
## Compute normalized gradient: | |||
<math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math> | |||
## Apply clipping: | |||
<math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math> | |||
## Update parameter: | |||
<math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math> | |||
# End for | |||
==== Adafactor for Weighted Matrices ==== | |||
'''Inputs:''' | |||
* Initial point: <math>X_0 \in \mathbb{R}^{n \times m}</math> | |||
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math> | |||
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math> | |||
* Regularization constants: <math>\epsilon_1, \epsilon_2</math> | |||
* Clipping threshold: <math>d</math> | |||
'''Algorithm:''' | |||
# For <math>t = 1</math> to <math>T</math>: | |||
## Compute adaptive step size: | |||
<math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math> | |||
## Compute gradient: | |||
<math>G_t = \nabla f_t(X_{t-1})</math> | |||
## Update row-wise second moment: | |||
<math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math> | |||
## Update column-wise second moment: | |||
<math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math> | |||
## Update overall second moment estimate: | |||
<math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math> | |||
## Compute normalized gradient: | |||
<math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math> | |||
## Apply clipping: | |||
<math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math> | |||
## Update parameter: | |||
<math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math> | |||
# End for | |||
=== 4. Proposed Hyperparameters for Adafactor === | |||
* Regularization constant 1: <math>\epsilon_1 = 10^{-30}</math> | |||
* Regularization constant 2: <math>\epsilon_2 = 10^{-3}</math> | |||
* Clipping threshold: <math>d = 1</math> | |||
* Relative step size: <math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> | |||
* Second moment decay: <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> | |||
== Numerical Examples == | == Numerical Examples == |
Revision as of 16:30, 10 December 2024
Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)
Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu
Introduction
Problem formulation
1. Objective
Minimize the loss function f(x), where x ∈ ℝⁿ and x is the weight vector to be optimized.
2. Parameters
- Gradient:
- Second moment estimate:
** Where: * is the running average of the squared gradient. * is the corrected decay parameter. * is a regularization constant.
- Step size:
** Where: * is the relative step size. * is a regularization constant. * is the root mean square, defined as:
3. Problem Formulation
Adafactor for Weighted Vectors
Inputs:
- Initial point:
- Relative step sizes: for to
- Second moment decay: for to , with
- Regularization constants:
- Clipping threshold:
Algorithm:
- For to :
- Compute adaptive step size:
- Compute gradient:
- Update second moment estimate:
- Compute normalized gradient:
- Apply clipping:
- Update parameter:
- End for
Adafactor for Weighted Matrices
Inputs:
- Initial point:
- Relative step sizes: for to
- Second moment decay: for to , with
- Regularization constants:
- Clipping threshold:
Algorithm:
- For to :
- Compute adaptive step size:
- Compute gradient:
- Update row-wise second moment:
- Update column-wise second moment:
- Update overall second moment estimate:
- Compute normalized gradient:
- Apply clipping:
- Update parameter:
- End for
4. Proposed Hyperparameters for Adafactor
- Regularization constant 1:
- Regularization constant 2:
- Clipping threshold:
- Relative step size:
- Second moment decay: