Adafactor: Difference between revisions

VisualWikitext

Revision as of 16:30, 10 December 2024

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

Introduction

Problem formulation

1. Objective

Minimize the loss function f(x), where x ∈ ℝⁿ and x is the weight vector to be optimized.

2. Parameters

Gradient:

  $G_{t}=\nabla f(x_{t-1})$

Second moment estimate:

  ${\hat {V}}_{t}={\hat {\beta }}_{2t}{\hat {V}}_{t-1}+(1-{\hat {\beta }}_{2t})(G_{t}^{2}+\epsilon _{1}1_{n})$ 
 ** Where:
   *  ${\hat {V}}_{t}$  is the running average of the squared gradient.
   *  ${\hat {\beta }}_{2t}$  is the corrected decay parameter.
   *  $\epsilon _{1}$  is a regularization constant.

Step size:

  $\alpha _{t}=\max(\epsilon _{2},{\text{RMS}}(x_{t-1}))\rho _{t}$ 
 ** Where:
   *  $\rho _{t}$  is the relative step size.
   *  $\epsilon _{2}$  is a regularization constant.
   *  ${\text{RMS}}$  is the root mean square, defined as:
      $u_{xt}={\frac {-g_{xt}}{\sqrt {{\hat {v}}_{xt}}}}$ 
      ${\text{RMS}}(U_{t})={\text{RMS}}_{x\in X}(u_{xt})={\sqrt {{\text{Mean}}_{x\in X}\left({\frac {(g_{xt})^{2}}{{\hat {v}}_{xt}}}\right)}}$

3. Problem Formulation

Adafactor for Weighted Vectors

Inputs:

Initial point: $X_{0}\in \mathbb {R} ^{n}$
Relative step sizes: $\rho _{t}$ for $t=1$ to $T$
Second moment decay: ${\hat {\beta }}_{2t}$ for $t=1$ to $T$ , with ${\hat {\beta }}_{21}=0$
Regularization constants: $\epsilon _{1},\epsilon _{2}$
Clipping threshold: $d$

Algorithm:

For $t=1$ $t=1$ to $T$ $T$ :
1. Compute adaptive step size:

   $\alpha _{t}=\max(\epsilon _{2},{\text{RMS}}(X_{t-1}))\rho _{t}$

1. Compute gradient:

   $G_{t}=\nabla f_{t}(X_{t-1})$

1. Update second moment estimate:

   ${\hat {V}}_{t}={\hat {\beta }}_{2t}{\hat {V}}_{t-1}+(1-{\hat {\beta }}_{2t})(G_{t}^{2}+\epsilon _{1}1_{n})$

1. Compute normalized gradient:

   $U_{t}={\frac {G_{t}}{\sqrt {{\hat {V}}_{t}}}}$

1. Apply clipping:

   ${\hat {U}}_{t}={\frac {U_{t}}{\max(1,{\text{RMS}}(U_{t})/d)}}$

1. Update parameter:

   $X_{t}=X_{t-1}-\alpha _{t}{\hat {U}}_{t}$

End for

Adafactor for Weighted Matrices

Inputs:

Initial point: $X_{0}\in \mathbb {R} ^{n\times m}$
Relative step sizes: $\rho _{t}$ for $t=1$ to $T$
Second moment decay: ${\hat {\beta }}_{2t}$ for $t=1$ to $T$ , with ${\hat {\beta }}_{21}=0$
Regularization constants: $\epsilon _{1},\epsilon _{2}$
Clipping threshold: $d$

Algorithm:

For $t=1$ $t=1$ to $T$ $T$ :
1. Compute adaptive step size:

   $\alpha _{t}=\max(\epsilon _{2},{\text{RMS}}(X_{t-1}))\rho _{t}$

1. Compute gradient:

   $G_{t}=\nabla f_{t}(X_{t-1})$

1. Update row-wise second moment:

   $R_{t}={\hat {\beta }}_{2t}R_{t-1}+(1-{\hat {\beta }}_{2t})(G_{t}^{2}+\epsilon _{1}1_{n}1_{m}^{T})1_{m}$

1. Update column-wise second moment:

   $C_{t}={\hat {\beta }}_{2t}C_{t-1}+(1-{\hat {\beta }}_{2t})1_{n}^{T}(G_{t}^{2}+\epsilon _{1}1_{n}1_{m}^{T})$

1. Update overall second moment estimate:

   ${\hat {V}}_{t}={\frac {R_{t}C_{t}}{1_{n}^{T}R_{t}}}$

1. Compute normalized gradient:

   $U_{t}={\frac {G_{t}}{\sqrt {{\hat {V}}_{t}}}}$

1. Apply clipping:

   ${\hat {U}}_{t}={\frac {U_{t}}{\max(1,{\text{RMS}}(U_{t})/d)}}$

1. Update parameter:

   $X_{t}=X_{t-1}-\alpha _{t}{\hat {U}}_{t}$

End for

4. Proposed Hyperparameters for Adafactor

Regularization constant 1: $\epsilon _{1}=10^{-30}$
Regularization constant 2: $\epsilon _{2}=10^{-3}$
Clipping threshold: $d=1$
Relative step size: $\rho _{t}=\min(10^{-2},1/{\sqrt {t}})$
Second moment decay: ${\hat {\beta }}_{2t}=1-t^{-0.8}$

@@ Line 5: / Line 5: @@
 == Introduction ==
 == Problem formulation ==
+=== 1. Objective ===
+Minimize the loss function '''f(x)''', where '''x ∈ ℝⁿ''' and '''x''' is the weight vector to be optimized.
+=== 2. Parameters ===
+* '''Gradient:'''
+  <math>G_t = \nabla f(x_{t-1})</math>
+* '''Second moment estimate:'''
+  <math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>
+  ** Where:
+    * <math>\hat{V}_t</math> is the running average of the squared gradient.
+    * <math>\hat{\beta}_{2t}</math> is the corrected decay parameter.
+    * <math>\epsilon_1</math> is a regularization constant.
+* '''Step size:'''
+  <math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math>
+  ** Where:
+    * <math>\rho_t</math> is the relative step size.
+    * <math>\epsilon_2</math> is a regularization constant.
+    * <math>\text{RMS}</math> is the root mean square, defined as:
+      <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
+      <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>
+=== 3. Problem Formulation ===
+==== Adafactor for Weighted Vectors ====
+'''Inputs:'''
+* Initial point: <math>X_0 \in \mathbb{R}^n</math>
+* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
+* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
+* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
+* Clipping threshold: <math>d</math>
+'''Algorithm:'''
+# For <math>t = 1</math> to <math>T</math>:
+## Compute adaptive step size:
+   <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
+## Compute gradient:
+   <math>G_t = \nabla f_t(X_{t-1})</math>
+## Update second moment estimate:
+   <math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>
+## Compute normalized gradient:
+   <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
+## Apply clipping:
+   <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
+## Update parameter:
+   <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
+# End for
+==== Adafactor for Weighted Matrices ====
+'''Inputs:'''
+* Initial point: <math>X_0 \in \mathbb{R}^{n \times m}</math>
+* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
+* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
+* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
+* Clipping threshold: <math>d</math>
+'''Algorithm:'''
+# For <math>t = 1</math> to <math>T</math>:
+## Compute adaptive step size:
+   <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
+## Compute gradient:
+   <math>G_t = \nabla f_t(X_{t-1})</math>
+## Update row-wise second moment:
+   <math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
+## Update column-wise second moment:
+   <math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
+## Update overall second moment estimate:
+   <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
+## Compute normalized gradient:
+   <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
+## Apply clipping:
+   <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
+## Update parameter:
+   <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
+# End for
+=== 4. Proposed Hyperparameters for Adafactor ===
+* Regularization constant 1: <math>\epsilon_1 = 10^{-30}</math>
+* Regularization constant 2: <math>\epsilon_2 = 10^{-3}</math>
+* Clipping threshold: <math>d = 1</math>
+* Relative step size: <math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math>
+* Second moment decay: <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>
 == Numerical Examples ==

Adafactor: Difference between revisions

Revision as of 16:30, 10 December 2024

Contents

Introduction

Problem formulation

1. Objective

2. Parameters

3. Problem Formulation

Adafactor for Weighted Vectors

Adafactor for Weighted Matrices

4. Proposed Hyperparameters for Adafactor

Numerical Examples

Applications

Conclusion

Reference

Navigation menu