From Cornell University Computational Optimization Open Textbook - Optimization Wiki
|
|
Line 43: |
Line 43: |
| ** Compute adaptive step size: | | ** Compute adaptive step size: |
| <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math> | | <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math> |
| ** Compute gradient:
| | *Compute gradient: |
| <math>G_t = \nabla f_t(X_{t-1})</math> | | <math>G_t = \nabla f_t(X_{t-1})</math> |
| ** Update second moment estimate: | | ** Update second moment estimate: |
Revision as of 16:55, 10 December 2024
Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)
Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu
Introduction
Problem formulation
1. Objective
Minimize the loss function
, where
and
is the weight vector to be optimized.
2. Parameters
- Where:
is the running average of the squared gradient.
is the corrected decay parameter.
is a regularization constant.
- Where:
is the relative step size.
is a regularization constant.
is the root mean square, defined as:
data:image/s3,"s3://crabby-images/b4aff/b4aff75a27a2fee7ab250e57cebd7f272e870ba2" alt="{\displaystyle u_{xt}={\frac {-g_{xt}}{\sqrt {{\hat {v}}_{xt}}}}}"
data:image/s3,"s3://crabby-images/35b78/35b789a7a542c062e168264a71b8e3b63d831989" alt="{\displaystyle {\text{RMS}}(U_{t})={\text{RMS}}_{x\in X}(u_{xt})={\sqrt {{\text{Mean}}_{x\in X}\left({\frac {(g_{xt})^{2}}{{\hat {v}}_{xt}}}\right)}}}"
3. Algorithms
Adafactor for Weighted Vectors
Inputs:
- Initial point:
data:image/s3,"s3://crabby-images/c09b1/c09b1341ed8a820753a76c98bd385e23d7025b48" alt="{\displaystyle X_{0}\in \mathbb {R} ^{n}}"
- Relative step sizes:
for
to data:image/s3,"s3://crabby-images/56268/56268398b168098ce256f2e55d6fe130e10acba7" alt="{\displaystyle T}"
- Second moment decay:
for
to
, with data:image/s3,"s3://crabby-images/d5152/d5152ec622cccf074c5e9e7c9ad92239ae335769" alt="{\displaystyle {\hat {\beta }}_{21}=0}"
- Regularization constants:
data:image/s3,"s3://crabby-images/2a050/2a050582be6325b0bd9cf187c9f3ecb6b9fb545e" alt="{\displaystyle \epsilon _{1},\epsilon _{2}}"
- Clipping threshold:
data:image/s3,"s3://crabby-images/f34e0/f34e068035d3614ee4ef3db9104a9e1f2d29d95b" alt="{\displaystyle d}"
Algorithm:
- For
to
:
- Compute adaptive step size:
- Update second moment estimate:
- Compute normalized gradient:
Adafactor for Weighted Matrices
Inputs:
- Initial point:
data:image/s3,"s3://crabby-images/1c8c4/1c8c456b09a617d23ffd1455d5ac828be3a7bc44" alt="{\displaystyle X_{0}\in \mathbb {R} ^{n\times m}}"
- Relative step sizes:
for
to data:image/s3,"s3://crabby-images/56268/56268398b168098ce256f2e55d6fe130e10acba7" alt="{\displaystyle T}"
- Second moment decay:
for
to
, with data:image/s3,"s3://crabby-images/d5152/d5152ec622cccf074c5e9e7c9ad92239ae335769" alt="{\displaystyle {\hat {\beta }}_{21}=0}"
- Regularization constants:
data:image/s3,"s3://crabby-images/2a050/2a050582be6325b0bd9cf187c9f3ecb6b9fb545e" alt="{\displaystyle \epsilon _{1},\epsilon _{2}}"
- Clipping threshold:
data:image/s3,"s3://crabby-images/f34e0/f34e068035d3614ee4ef3db9104a9e1f2d29d95b" alt="{\displaystyle d}"
Algorithm:
- For
to
:
- Compute adaptive step size:
- Compute gradient:
- Update row-wise second moment:
- Update column-wise second moment:
- Update overall second moment estimate:
- Compute normalized gradient:
- Apply clipping:
- Update parameter:
- End for
4. Proposed Hyperparameters for Adafactor
- Regularization constant 1:
data:image/s3,"s3://crabby-images/d418c/d418c9b7994cbf0957768bac21225a5acf6f24df" alt="{\displaystyle \epsilon _{1}=10^{-30}}"
- Regularization constant 2:
data:image/s3,"s3://crabby-images/f2550/f2550fe730ba2ccde99d2acb0b9a0f3ae75f659f" alt="{\displaystyle \epsilon _{2}=10^{-3}}"
- Clipping threshold:
data:image/s3,"s3://crabby-images/8e0a9/8e0a96d95b1654dda92225c4115f7199f1e4a964" alt="{\displaystyle d=1}"
- Relative step size:
data:image/s3,"s3://crabby-images/534a9/534a9e8a385d9b25638483954adb14fd673b4341" alt="{\displaystyle \rho _{t}=\min(10^{-2},1/{\sqrt {t}})}"
- Second moment decay:
data:image/s3,"s3://crabby-images/d7403/d7403408c1c8ed959e94110606b80e9243c4c34b" alt="{\displaystyle {\hat {\beta }}_{2t}=1-t^{-0.8}}"
Numerical Examples
Applications
Conclusion
Reference