From Cornell University Computational Optimization Open Textbook - Optimization Wiki
Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)
Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu
Introduction
Problem formulation
1. Objective
Minimize the loss function
, where
and
is the weight vector to be optimized.
2. Parameters
- Where:
is the running average of the squared gradient.
is the corrected decay parameter.
is a regularization constant.
- Where:
is the relative step size.
is a regularization constant.
is the root mean square, defined as:


3. Algorithms
Adafactor for Weighted Vectors
Inputs:
- Initial point:

- Relative step sizes:
for
to 
- Second moment decay:
for
to
, with 
- Regularization constants:

- Clipping threshold:

Algorithm:
- For
to
:
- Compute adaptive step size:

- Compute gradient:

- Update second moment estimate:

- Compute normalized gradient:

- Apply clipping:

- Update parameter:

- End for
Adafactor for Weighted Matrices
Inputs:
- Initial point:

- Relative step sizes:
for
to 
- Second moment decay:
for
to
, with 
- Regularization constants:

- Clipping threshold:

Algorithm:
- For
to
:
- Compute adaptive step size:

- Compute gradient:

- Update row-wise second moment:

- Update column-wise second moment:

- Update overall second moment estimate:

- Compute normalized gradient:

- Apply clipping:

- Update parameter:

- End for
4. Proposed Hyperparameters for Adafactor
- Regularization constant 1:

- Regularization constant 2:

- Clipping threshold:

- Relative step size:

- Second moment decay:

Numerical Examples
Step-by-step instructions for determining the result of the first iteration.
Problem setup
Initial weights (
):
Gradient (
):
Hyperparameters setup
(Minimum learning rate scaling factor))
(Regularization constant)
(Clipping threshold)
(Relative step size)
(Second moment decay)
Step 1: Learning Rate Scaling
Define the relative step size
Step 1.1: Root Mean Square(RMS) calculation for
Root Mean Square(RMS) calculation for
RMS formula
Substitute the initial weights
Find the Learning Rate Scaling (αt):
Applications
Conclusion
Reference