Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)
Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu
Introduction
Problem formulation
1. Objective
Minimize the loss function , where and is the weight vector to be optimized.
2. Parameters
- Where:
- is the running average of the squared gradient.
- is the corrected decay parameter.
- is a regularization constant.
- Where:
- is the relative step size.
- is a regularization constant.
- is the root mean square, defined as:
3. Algorithms
Adafactor for Weighted Vectors
Inputs:
- Initial point:
- Relative step sizes: for to
- Second moment decay: for to , with
- Regularization constants:
- Clipping threshold:
Algorithm:
- For to :
- Compute adaptive step size:
- Compute gradient:
- Update second moment estimate:
- Compute normalized gradient:
- Apply clipping:
- Update parameter:
- End for
Adafactor for Weighted Matrices
Inputs:
- Initial point:
- Relative step sizes: for to
- Second moment decay: for to , with
- Regularization constants:
- Clipping threshold:
Algorithm:
- For to :
- Compute adaptive step size:
- Compute gradient:
- Update row-wise second moment:
- Update column-wise second moment:
- Update overall second moment estimate:
- Compute normalized gradient:
- Apply clipping:
- Update parameter:
- End for
4. Proposed Hyperparameters for Adafactor
- Regularization constant 1:
- Regularization constant 2:
- Clipping threshold:
- Relative step size:
- Second moment decay:
Numerical Examples
Step-by-step instructions for determining the result of the first iteration.
Problem setup
Initial weights ():
Initial gradient ():
Gradient of the loss function with respect to X
Hyperparameters setup
(Minimum learning rate scaling factor))
(Regularization constant)
(Clipping threshold)
(Relative step size)
(Second moment decay)
Step 1: Learning Rate Scaling
Define the relative step size
Step 1.1: Root Mean Square(RMS) calculation for
Root Mean Square(RMS) calculation for
RMS formula
Substitute the initial weights
Step 1.2: Find the Learning Rate Scaling ():
Learning rate formula
Substitute the RMS
Step 2: Compute (Element-wise Square of Gradient)
Compute the squared value of each element in the gradient matrix .
Step 3: Find the moment estimate
Compute the exponential moving average of squared gradients to capture the variance or scale of gradients.
Step 3.1: Compute row moments ()
This equation computes the row-wise second moments ( ) as an exponential moving average of past moments () and the current row-wise mean of squared gradients ( ), with a balance controlled by ().
For
Since , for first iteration: . And because is too small, we can ignore it. The update of is:
Row-wise mean ():
Step 3.2: Compute column moments ()
The process is same as row moments
Column-wise mean ():
Step 3.3: Second Moment Estimate ()
The Second Moment Estimate is calculated as the outer product of the row moments () and column moments ().
Step 4: Update the vector ()
step 4.1: Find the vector value of
Formula of
Substitute and
step 4.2: Clipped Update Vector
Formula of
Compute RMS of
Since RMS()>d, scale by
Step 4: Weight Update ()
The result for first iteration
Applications
Conclusion
Reference