Adafactor: Difference between revisions
Jump to navigation
Jump to search
Tag: Manual revert |
|||
| Line 41: | Line 41: | ||
'''Algorithm:''' | '''Algorithm:''' | ||
* For <math>t = 1</math> to <math>T</math>: | * For <math>t = 1</math> to <math>T</math>: | ||
** Compute adaptive step size: | ** Compute adaptive step size:<math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math> | ||
<math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math> | ** Compute gradient:<math>G_t = \nabla f_t(X_{t-1})</math> | ||
** Compute gradient: | ** Update second moment estimate:<math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math> | ||
<math>G_t = \nabla f_t(X_{t-1})</math> | ** Compute normalized gradient:<math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math> | ||
** Update second moment estimate: | ** Apply clipping:<math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math> | ||
<math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math> | ** Update parameter:<math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math> | ||
** Compute normalized gradient: | |||
<math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math> | |||
** Apply clipping: | |||
<math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math> | |||
** Update parameter: | |||
<math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math> | |||
* End for | * End for | ||
Revision as of 17:59, 10 December 2024
Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)
Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu
Introduction
Problem formulation
1. Objective
Minimize the loss function $ f(x) $, where $ x \in R^n $ and $ x $ is the weight vector to be optimized.
2. Parameters
- Gradient:
$ G_t = \nabla f(x_{t-1}) $
- Second moment estimate:
$ \hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n) $
- Where:
- $ \hat{V}_t $ is the running average of the squared gradient.
- $ \hat{\beta}_{2t} $ is the corrected decay parameter.
- $ \epsilon_1 $ is a regularization constant.
- Step size:
$ \alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t $
- Where:
- $ \rho_t $ is the relative step size.
- $ \epsilon_2 $ is a regularization constant.
- $ \text{RMS} $ is the root mean square, defined as:
- $ u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}} $
- $ \text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)} $
3. Algorithms
Adafactor for Weighted Vectors
Inputs:
- Initial point: $ X_0 \in \mathbb{R}^n $
- Relative step sizes: $ \rho_t $ for $ t = 1 $ to $ T $
- Second moment decay: $ \hat{\beta}_{2t} $ for $ t = 1 $ to $ T $, with $ \hat{\beta}_{21} = 0 $
- Regularization constants: $ \epsilon_1, \epsilon_2 $
- Clipping threshold: $ d $
Algorithm:
- For $ t = 1 $ to $ T $:
- Compute adaptive step size:$ \alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t $
- Compute gradient:$ G_t = \nabla f_t(X_{t-1}) $
- Update second moment estimate:$ \hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n) $
- Compute normalized gradient:$ U_t = \frac{G_t}{\sqrt{\hat{V}_t}} $
- Apply clipping:$ \hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)} $
- Update parameter:$ X_t = X_{t-1} - \alpha_t \hat{U}_t $
- End for
Adafactor for Weighted Matrices
Inputs:
- Initial point: $ X_0 \in \mathbb{R}^{n \times m} $
- Relative step sizes: $ \rho_t $ for $ t = 1 $ to $ T $
- Second moment decay: $ \hat{\beta}_{2t} $ for $ t = 1 $ to $ T $, with $ \hat{\beta}_{21} = 0 $
- Regularization constants: $ \epsilon_1, \epsilon_2 $
- Clipping threshold: $ d $
Algorithm:
- For $ t = 1 $ to $ T $:
- Compute adaptive step size:
$ \alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t $
- Compute gradient:
$ G_t = \nabla f_t(X_{t-1}) $
- Update row-wise second moment:
$ R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m $
- Update column-wise second moment:
$ C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T) $
- Update overall second moment estimate:
$ \hat{V}_t = \frac{R_t C_t}{1_n^T R_t} $
- Compute normalized gradient:
$ U_t = \frac{G_t}{\sqrt{\hat{V}_t}} $
- Apply clipping:
$ \hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)} $
- Update parameter:
$ X_t = X_{t-1} - \alpha_t \hat{U}_t $
- End for
4. Proposed Hyperparameters for Adafactor
- Regularization constant 1: $ \epsilon_1 = 10^{-30} $
- Regularization constant 2: $ \epsilon_2 = 10^{-3} $
- Clipping threshold: $ d = 1 $
- Relative step size: $ \rho_t = \min(10^{-2}, 1/\sqrt{t}) $
- Second moment decay: $ \hat{\beta}_{2t} = 1 - t^{-0.8} $