Adafactor: Difference between revisions

From Cornell University Computational Optimization Open Textbook - Optimization Wiki
Jump to navigation Jump to search
Line 58: Line 58:


'''Algorithm:'''
'''Algorithm:'''
# For <math>t = 1</math> to <math>T</math>:
* For <math>t = 1</math> to <math>T</math>:
## Compute adaptive step size:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
  <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
## Compute gradient:
** Update row-wise second moment: <math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
  <math>G_t = \nabla f_t(X_{t-1})</math>
** Update column-wise second moment: <math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
## Update row-wise second moment:
** Update overall second moment estimate: <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
  <math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
## Update column-wise second moment:
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
  <math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
## Update overall second moment estimate:
* End for
  <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
## Compute normalized gradient:
  <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
## Apply clipping:
  <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
## Update parameter:
  <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
# End for


=== 4. Proposed Hyperparameters for Adafactor ===
=== 4. Proposed Hyperparameters for Adafactor ===

Revision as of 17:00, 10 December 2024

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

Introduction

Problem formulation

1. Objective

Minimize the loss function , where and is the weight vector to be optimized.

2. Parameters

  • Gradient:

  • Second moment estimate:

  • Where:
    • is the running average of the squared gradient.
    • is the corrected decay parameter.
    • is a regularization constant.
  • Step size:

  • Where:
    • is the relative step size.
    • is a regularization constant.
    • is the root mean square, defined as:

3. Algorithms

Adafactor for Weighted Vectors

Inputs:

  • Initial point:
  • Relative step sizes: for to
  • Second moment decay: for to , with
  • Regularization constants:
  • Clipping threshold:

Algorithm:

  • For to :
    • Compute adaptive step size:
    • Compute gradient:
    • Update second moment estimate:
    • Compute normalized gradient:
    • Apply clipping:
    • Update parameter:
  • End for

Adafactor for Weighted Matrices

Inputs:

  • Initial point:
  • Relative step sizes: for to
  • Second moment decay: for to , with
  • Regularization constants:
  • Clipping threshold:

Algorithm:

  • For to :
    • Compute adaptive step size:
    • Compute gradient:
    • Update row-wise second moment:
    • Update column-wise second moment:
    • Update overall second moment estimate:
    • Compute normalized gradient:
    • Apply clipping:
    • Update parameter:
  • End for

4. Proposed Hyperparameters for Adafactor

  • Regularization constant 1:
  • Regularization constant 2:
  • Clipping threshold:
  • Relative step size:
  • Second moment decay:

Numerical Examples

Applications

Conclusion

Reference