Adafactor: Difference between revisions
Jump to navigation
Jump to search
Line 70: | Line 70: | ||
=== 4. Proposed Hyperparameters for Adafactor === | === 4. Proposed Hyperparameters for Adafactor === | ||
* Regularization constant 1: <math>\epsilon_1 = 10^{-30}</math> | * '''Regularization constant 1''': <math>\epsilon_1 = 10^{-30}</math> | ||
* Ensures numerical stability by preventing division by zero in the calculation of second-moment estimates, so the numerical value should be very close to zero | * Ensures numerical stability by preventing division by zero in the calculation of second-moment estimates, so the numerical value should be very close to zero | ||
* Regularization constant 2: <math>\epsilon_2 = 10^{-3}</math> | * '''Regularization constant 2''': <math>\epsilon_2 = 10^{-3}</math> | ||
* Help to stabilize parameter updates by controlling the effect of second-moment scaling in low-magnitude scenarios. Compared to <math>\epsilon_2</math>, a relatively larger value ensures the stability of noise and low-magnitude scenarios. | * Help to stabilize parameter updates by controlling the effect of second-moment scaling in low-magnitude scenarios. Compared to <math>\epsilon_2</math>, a relatively larger value ensures the stability of noise and low-magnitude scenarios. | ||
* Clipping threshold: <math>d = 1</math> | * '''Clipping threshold''': <math>d = 1</math> | ||
* A threshold of 1 balances stability and learning efficiency. It avoids excessive suppression of large gradients, which could hinder learning, while still protecting against extreme updates that could destabilize the model. | * A threshold of 1 balances stability and learning efficiency. It avoids excessive suppression of large gradients, which could hinder learning, while still protecting against extreme updates that could destabilize the model. | ||
* Relative step size: <math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> | * '''Relative step size''': <math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> | ||
** <math>min(10^-2, ...)</math> can caps the learning rate at 10^-2, which is a empirical found for upper bound | ** <math>min(10^-2, ...)</math> can caps the learning rate at 10^-2, which is a empirical found for upper bound | ||
** <math>\frac{1}{\sqrt{t}}</math> This step size promote convergence of the model. This rate ensures a balance between sufficient exploration in early iteration and stability in later iterations | ** <math>\frac{1}{\sqrt{t}}</math> This step size promote convergence of the model. This rate ensures a balance between sufficient exploration in early iteration and stability in later iterations | ||
* Second moment decay: <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> | * '''Second moment decay''': <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> | ||
** 1-...: ensures the decay factor remains close to 1 | ** 1-...: ensures the decay factor remains close to 1 | ||
** <math>t^{-0,8}</math> the power 0.8 ensures a balance between rapid adaptation in early training and later iterations | ** <math>t^{-0,8}</math> the power 0.8 ensures a balance between rapid adaptation in early training and later iterations |
Revision as of 22:45, 10 December 2024
Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)
Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu
Introduction
Problem formulation
1. Objective
Minimize the loss function , where and is the weight vector to be optimized.
2. Parameters
- Gradient:
- Second moment estimate:
- Where:
- is the running average of the squared gradient.
- is the corrected decay parameter.
- is a regularization constant.
- Step size:
- Where:
- is the relative step size.
- is a regularization constant.
- is the root mean square, defined as:
3. Algorithms
Adafactor for Weighted Vectors
Inputs:
- Initial point:
- Relative step sizes: for to
- Second moment decay: for to , with
- Regularization constants:
- Clipping threshold:
Algorithm:
- For to :
- Compute adaptive step size:
- Compute gradient:
- Update second moment estimate:
- Compute normalized gradient:
- Apply clipping:
- Update parameter:
- End for
Adafactor for Weighted Matrices
Inputs:
- Initial point:
- Relative step sizes: for to
- Second moment decay: for to , with
- Regularization constants:
- Clipping threshold:
Algorithm:
- For to :
- Compute adaptive step size:
- Compute gradient:
- Update row-wise second moment:
- Update column-wise second moment:
- Update overall second moment estimate:
- Compute normalized gradient:
- Apply clipping:
- Update parameter:
- End for
4. Proposed Hyperparameters for Adafactor
- Regularization constant 1:
- Ensures numerical stability by preventing division by zero in the calculation of second-moment estimates, so the numerical value should be very close to zero
- Regularization constant 2:
- Help to stabilize parameter updates by controlling the effect of second-moment scaling in low-magnitude scenarios. Compared to , a relatively larger value ensures the stability of noise and low-magnitude scenarios.
- Clipping threshold:
- A threshold of 1 balances stability and learning efficiency. It avoids excessive suppression of large gradients, which could hinder learning, while still protecting against extreme updates that could destabilize the model.
- Relative step size:
- can caps the learning rate at 10^-2, which is a empirical found for upper bound
- This step size promote convergence of the model. This rate ensures a balance between sufficient exploration in early iteration and stability in later iterations
- Second moment decay:
- 1-...: ensures the decay factor remains close to 1
- the power 0.8 ensures a balance between rapid adaptation in early training and later iterations