Adafactor: Difference between revisions

From Cornell University Computational Optimization Open Textbook - Optimization Wiki
Jump to navigation Jump to search
Line 12: Line 12:
<math>G_t = \nabla f(x_{t-1})</math>
<math>G_t = \nabla f(x_{t-1})</math>


* '''Second moment estimate:'''  
* '''Second moment estimate:'''
<math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>
 
* '''Where''':
<math> \hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>
<math>\hat{V}_t</math> is the running average of the squared gradient.
 
* '''Where:'''
** <math>\hat{V}_t</math> is the running average of the squared gradient.
<math>\hat{\beta}_{2t}</math> is the corrected decay parameter.
<math>\hat{\beta}_{2t}</math> is the corrected decay parameter.
<math>\epsilon_1</math> is a regularization constant.
<math>\epsilon_1</math> is a regularization constant.
Line 21: Line 23:
* '''Step size:'''  
* '''Step size:'''  
<math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math>
<math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math>
* Where:
* '''Where''':
* <math>\rho_t</math> is the relative step size.
** <math>\rho_t</math> is the relative step size.
* <math>\epsilon_2</math> is a regularization constant.
** <math>\epsilon_2</math> is a regularization constant.
* <math>\text{RMS}</math> is the root mean square, defined as:
** <math>\text{RMS}</math> is the root mean square, defined as:
  <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
*** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
  <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>
*** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>
 
=== 3. Problem Formulation ===
=== 3. Problem Formulation ===
==== Adafactor for Weighted Vectors ====
==== Adafactor for Weighted Vectors ====

Revision as of 16:49, 10 December 2024

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

Introduction

Problem formulation

1. Objective

Minimize the loss function , where and is the weight vector to be optimized.

2. Parameters

  • Gradient:

  • Second moment estimate:

  • Where:
    • is the running average of the squared gradient.

is the corrected decay parameter. is a regularization constant.

  • Step size:

  • Where:
    • is the relative step size.
    • is a regularization constant.
    • is the root mean square, defined as:

3. Problem Formulation

Adafactor for Weighted Vectors

Inputs:

  • Initial point:
  • Relative step sizes: for to
  • Second moment decay: for to , with
  • Regularization constants:
  • Clipping threshold:

Algorithm:

  1. For to :
    1. Compute adaptive step size:
  
    1. Compute gradient:
  
    1. Update second moment estimate:
  
    1. Compute normalized gradient:
  
    1. Apply clipping:
  
    1. Update parameter:
  
  1. End for

Adafactor for Weighted Matrices

Inputs:

  • Initial point:
  • Relative step sizes: for to
  • Second moment decay: for to , with
  • Regularization constants:
  • Clipping threshold:

Algorithm:

  1. For to :
    1. Compute adaptive step size:
  
    1. Compute gradient:
  
    1. Update row-wise second moment:
  
    1. Update column-wise second moment:
  
    1. Update overall second moment estimate:
  
    1. Compute normalized gradient:
  
    1. Apply clipping:
  
    1. Update parameter:
  
  1. End for

4. Proposed Hyperparameters for Adafactor

  • Regularization constant 1:
  • Regularization constant 2:
  • Clipping threshold:
  • Relative step size:
  • Second moment decay:

Numerical Examples

Applications

Conclusion

Reference