From Cornell University Computational Optimization Open Textbook - Optimization Wiki
|
|
Line 12: |
Line 12: |
| <math>G_t = \nabla f(x_{t-1})</math> | | <math>G_t = \nabla f(x_{t-1})</math> |
|
| |
|
| * '''Second moment estimate:''' | | * '''Second moment estimate:''' |
| <math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math> | | |
| * '''Where''': | | <math> \hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math> |
| <math>\hat{V}_t</math> is the running average of the squared gradient. | | |
| | * '''Where:''' |
| | ** <math>\hat{V}_t</math> is the running average of the squared gradient. |
| <math>\hat{\beta}_{2t}</math> is the corrected decay parameter. | | <math>\hat{\beta}_{2t}</math> is the corrected decay parameter. |
| <math>\epsilon_1</math> is a regularization constant. | | <math>\epsilon_1</math> is a regularization constant. |
Line 21: |
Line 23: |
| * '''Step size:''' | | * '''Step size:''' |
| <math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math> | | <math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math> |
| * Where: | | * '''Where''': |
| * <math>\rho_t</math> is the relative step size. | | ** <math>\rho_t</math> is the relative step size. |
| * <math>\epsilon_2</math> is a regularization constant. | | ** <math>\epsilon_2</math> is a regularization constant. |
| * <math>\text{RMS}</math> is the root mean square, defined as: | | ** <math>\text{RMS}</math> is the root mean square, defined as: |
| <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
| | *** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math> |
| <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>
| | *** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math> |
| | |
| === 3. Problem Formulation === | | === 3. Problem Formulation === |
| ==== Adafactor for Weighted Vectors ==== | | ==== Adafactor for Weighted Vectors ==== |
Revision as of 16:49, 10 December 2024
Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)
Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu
Introduction
Problem formulation
1. Objective
Minimize the loss function , where and is the weight vector to be optimized.
2. Parameters
- Where:
- is the running average of the squared gradient.
is the corrected decay parameter.
is a regularization constant.
- Where:
- is the relative step size.
- is a regularization constant.
- is the root mean square, defined as:
3. Problem Formulation
Adafactor for Weighted Vectors
Inputs:
- Initial point:
- Relative step sizes: for to
- Second moment decay: for to , with
- Regularization constants:
- Clipping threshold:
Algorithm:
- For to :
- Compute adaptive step size:
- Compute gradient:
- Update second moment estimate:
- Compute normalized gradient:
- Apply clipping:
- Update parameter:
- End for
Adafactor for Weighted Matrices
Inputs:
- Initial point:
- Relative step sizes: for to
- Second moment decay: for to , with
- Regularization constants:
- Clipping threshold:
Algorithm:
- For to :
- Compute adaptive step size:
- Compute gradient:
- Update row-wise second moment:
- Update column-wise second moment:
- Update overall second moment estimate:
- Compute normalized gradient:
- Apply clipping:
- Update parameter:
- End for
4. Proposed Hyperparameters for Adafactor
- Regularization constant 1:
- Regularization constant 2:
- Clipping threshold:
- Relative step size:
- Second moment decay:
Numerical Examples
Applications
Conclusion
Reference