Adafactor: Difference between revisions

From Cornell University Computational Optimization Open Textbook - Optimization Wiki

Jump to navigation Jump to search

Revision as of 16:26, 10 December 2024

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

Introduction

Problem formulation

1. Objective

Minimize the loss function f(x), where x ∈ ℝⁿ and x is the weight vector to be optimized.

2. Parameters

Gradient:
G_t = ∇f(x_t-1)
Second moment estimate:
Ĥ_Vt = Ĥ_β2t Ĥ_Vt-1 + (1 - Ĥ_β2t)(G_t² + ε₁ 1ₙ)
- Ĥ_Vt is the running average of the squared gradient.
- Ĥ_β2t is the corrected decay parameter.
- ε₁ is a regularization constant.
Step size:
α_t = max(ε₂, RMS(x_t-1)) ρ_t
- ρ_t is the relative step size.
- ε₂ is a regularization constant.
- RMS is the root mean square, defined as:
  u_xt = -g_xt / √Ĥ_vxt
  
  RMS(U_t) = RMS_{x ∈ X}(u_xt) = √Mean_{x ∈ X}(g_xt² / Ĥ_vxt)

3. Problem Formulation

Adafactor for Weighted Vectors

Inputs:

Initial point: X₀ ∈ ℝⁿ
Relative step sizes: ρ_t for t = 1 to T
Second moment decay: Ĥ_β2t for t = 1 to T, with Ĥ_β21 = 0
Regularization constants: ε₁, ε₂
Clipping threshold: d

Algorithm:

For t = 1 to T:
- Compute adaptive step size:
  α_t = max(ε₂, RMS(X_t-1)) ρ_t
- Compute gradient:
  G_t = ∇f_t(X_t-1)
- Update second moment estimate:
  Ĥ_Vt = Ĥ_β2t Ĥ_Vt-1 + (1 - Ĥ_β2t)(G_t² + ε₁ 1ₙ)
- Compute normalized gradient:
  U_t = G_t / √Ĥ_Vt
- Apply clipping:
  Ĥ_{U_t} = U_t / max(1, RMS(U_t) / d)
- Update parameter:
  X_t = X_t-1 - α_t Ĥ_{U_t}

Adafactor for Weighted Matrices

Inputs:

Initial point: X₀ ∈ ℝⁿ × ℝ^m
Relative step sizes: ρ_t for t = 1 to T
Second moment decay: Ĥ_β2t for t = 1 to T, with Ĥ_β21 = 0
Regularization constants: ε₁, ε₂
Clipping threshold: d

Algorithm:

For t = 1 to T:
- Compute adaptive step size:
  α_t = max(ε₂, RMS(X_t-1)) ρ_t
- Compute gradient:
  G_t = ∇f_t(X_t-1)
- Update row-wise second moment:
  R_t = Ĥ_β2t R_t-1 + (1 - Ĥ_β2t)(G_t² + ε₁ 1ₙ 1ₘᵀ) 1ₘ
- Update column-wise second moment:
  C_t = Ĥ_β2t C_t-1 + (1 - Ĥ_β2t) 1ₙᵀ (G_t² + ε₁ 1ₙ 1ₘᵀ)
- Update overall second moment estimate:
  Ĥ_Vt = R_t C_t / (1ₙᵀ R_t)
- Compute normalized gradient:
  U_t = G_t / √Ĥ_Vt
- Apply clipping:
  Ĥ_{U_t} = U_t / max(1, RMS(U_t) / d)
- Update parameter:
  X_t = X_t-1 - α_t Ĥ_{U_t}

4. Proposed Hyperparameters for Adafactor

Regularization constant 1: ε₁ = 10⁻³⁰
Regularization constant 2: ε₂ = 10⁻³
Clipping threshold: d = 1
Relative step size: ρ_t = min(10⁻², 1/√t)
Second moment decay: Ĥ_β2t = 1 - t⁻⁰.⁸

</body> </html>

Numerical Examples

Applications

Conclusion

Reference

Retrieved from "https://optimization.cbe.cornell.edu/index.php?title=Adafactor&oldid=6863"