Adafactor

From Cornell University Computational Optimization Open Textbook - Optimization Wiki
Jump to navigation Jump to search

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

Introduction

Problem formulation

1. Objective

Minimize the loss function f(x), where x ∈ ℝⁿ and x is the weight vector to be optimized.

2. Parameters

  • Gradient:
 
  • Second moment estimate:
 
 ** Where:
   *  is the running average of the squared gradient.
   *  is the corrected decay parameter.
   *  is a regularization constant.
  • Step size:
 
 ** Where:
   *  is the relative step size.
   *  is a regularization constant.
   *  is the root mean square, defined as:
     
     

3. Problem Formulation

Adafactor for Weighted Vectors

Inputs:

  • Initial point:
  • Relative step sizes: for to
  • Second moment decay: for to , with
  • Regularization constants:
  • Clipping threshold:

Algorithm:

  1. For to :
    1. Compute adaptive step size:
  
    1. Compute gradient:
  
    1. Update second moment estimate:
  
    1. Compute normalized gradient:
  
    1. Apply clipping:
  
    1. Update parameter:
  
  1. End for

Adafactor for Weighted Matrices

Inputs:

  • Initial point:
  • Relative step sizes: for to
  • Second moment decay: for to , with
  • Regularization constants:
  • Clipping threshold:

Algorithm:

  1. For to :
    1. Compute adaptive step size:
  
    1. Compute gradient:
  
    1. Update row-wise second moment:
  
    1. Update column-wise second moment:
  
    1. Update overall second moment estimate:
  
    1. Compute normalized gradient:
  
    1. Apply clipping:
  
    1. Update parameter:
  
  1. End for

4. Proposed Hyperparameters for Adafactor

  • Regularization constant 1:
  • Regularization constant 2:
  • Clipping threshold:
  • Relative step size:
  • Second moment decay:

Numerical Examples

Applications

Conclusion

Reference