Adafactor

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

Introduction

Problem formulation

1. Objective

Minimize the loss function $f(x)$ , where $x\in R^{n}$ and $x$ is the weight vector to be optimized.

2. Parameters

Gradient:

$G_{t}=\nabla f(x_{t-1})$

Second moment estimate:

${\hat {V}}_{t}={\hat {\beta }}_{2t}{\hat {V}}_{t-1}+(1-{\hat {\beta }}_{2t})(G_{t}^{2}+\epsilon _{1}1_{n})$

Where:
- ${\hat {V}}_{t}$ is the running average of the squared gradient.
- ${\hat {\beta }}_{2t}$ is the corrected decay parameter.
- $\epsilon _{1}$ is a regularization constant.

Step size:

$\alpha _{t}=\max(\epsilon _{2},{\text{RMS}}(x_{t-1}))\rho _{t}$

Where:
- $\rho _{t}$ is the relative step size.
- $\epsilon _{2}$ is a regularization constant.
- ${\text{RMS}}$ ${\text{RMS}}$ is the root mean square, defined as:
  - $u_{xt}={\frac {-g_{xt}}{\sqrt {{\hat {v}}_{xt}}}}$
  - ${\text{RMS}}(U_{t})={\text{RMS}}_{x\in X}(u_{xt})={\sqrt {{\text{Mean}}_{x\in X}\left({\frac {(g_{xt})^{2}}{{\hat {v}}_{xt}}}\right)}}$

3. Algorithms

Adafactor for Weighted Vectors

Inputs:

Initial point: $X_{0}\in \mathbb {R} ^{n}$
Relative step sizes: $\rho _{t}$ for $t=1$ to $T$
Second moment decay: ${\hat {\beta }}_{2t}$ for $t=1$ to $T$ , with ${\hat {\beta }}_{21}=0$
Regularization constants: $\epsilon _{1},\epsilon _{2}$
Clipping threshold: $d$

Algorithm:

For $t=1$ $t=1$ to $T$ $T$ :
- Compute adaptive step size: $\alpha _{t}=\max(\epsilon _{2},{\text{RMS}}(X_{t-1}))\rho _{t}$
- Compute gradient: $G_{t}=\nabla f_{t}(X_{t-1})$
- Update second moment estimate: ${\hat {V}}_{t}={\hat {\beta }}_{2t}{\hat {V}}_{t-1}+(1-{\hat {\beta }}_{2t})(G_{t}^{2}+\epsilon _{1}1_{n})$
- Compute normalized gradient: $U_{t}={\frac {G_{t}}{\sqrt {{\hat {V}}_{t}}}}$
- Apply clipping: ${\hat {U}}_{t}={\frac {U_{t}}{\max(1,{\text{RMS}}(U_{t})/d)}}$
- Update parameter: $X_{t}=X_{t-1}-\alpha _{t}{\hat {U}}_{t}$
End for

Adafactor for Weighted Matrices

Inputs:

Initial point: $X_{0}\in \mathbb {R} ^{n\times m}$
Relative step sizes: $\rho _{t}$ for $t=1$ to $T$
Second moment decay: ${\hat {\beta }}_{2t}$ for $t=1$ to $T$ , with ${\hat {\beta }}_{21}=0$
Regularization constants: $\epsilon _{1},\epsilon _{2}$
Clipping threshold: $d$

Algorithm:

For $t=1$ $t=1$ to $T$ $T$ :
- Compute adaptive step size: $\alpha _{t}=\max(\epsilon _{2},{\text{RMS}}(X_{t-1}))\rho _{t}$
- Compute gradient: $G_{t}=\nabla f_{t}(X_{t-1})$
- Update row-wise second moment: $R_{t}={\hat {\beta }}_{2t}R_{t-1}+(1-{\hat {\beta }}_{2t})(G_{t}^{2}+\epsilon _{1}1_{n}1_{m}^{T})1_{m}$
- Update column-wise second moment: $C_{t}={\hat {\beta }}_{2t}C_{t-1}+(1-{\hat {\beta }}_{2t})1_{n}^{T}(G_{t}^{2}+\epsilon _{1}1_{n}1_{m}^{T})$
- Update overall second moment estimate: ${\hat {V}}_{t}={\frac {R_{t}C_{t}}{1_{n}^{T}R_{t}}}$
- Compute normalized gradient: $U_{t}={\frac {G_{t}}{\sqrt {{\hat {V}}_{t}}}}$
- Apply clipping: ${\hat {U}}_{t}={\frac {U_{t}}{\max(1,{\text{RMS}}(U_{t})/d)}}$
- Update parameter: $X_{t}=X_{t-1}-\alpha _{t}{\hat {U}}_{t}$
End for

4. Proposed Hyperparameters for Adafactor

Regularization constant 1: $\epsilon _{1}=10^{-30}$
Regularization constant 2: $\epsilon _{2}=10^{-3}$
Clipping threshold: $d=1$
Relative step size: $\rho _{t}=\min(10^{-2},1/{\sqrt {t}})$
Second moment decay: ${\hat {\beta }}_{2t}=1-t^{-0.8}$

Numerical Examples

Step-by-step instructions for determining the result of the first iteration.

Problem setup

Initial weights ( $X_{0}$ ):

$X_{0}={\begin{bmatrix}0.7&-0.5&0.9\\-1.1&0.8&-1.6\\1.2&-0.7&0.4\end{bmatrix}}$

Initial gradient ( $G_{t}$ ):

Gradient of the loss function with respect to X

$G_{t}={\begin{bmatrix}0.3&-0.2&0.4\\-0.5&0.6&-0.1\\0.2&-0.4&0.3\end{bmatrix}}$

Hyperparameters setup

$\epsilon _{1}=10^{-30}$ (Minimum learning rate scaling factor))

$\epsilon _{2}=10^{-3}$ (Regularization constant)

$d=1$ (Clipping threshold)

$\rho _{t}=\min(10^{-2},1/{\sqrt {t}})$ (Relative step size)

${\hat {\beta }}_{2t}=1-t^{-0.8}$ (Second moment decay)

Step 1: Learning Rate Scaling

Define the relative step size

$\rho _{1}=\min(10^{-2},1/{\sqrt {1}})=10^{-2}$

Step 1.1: Root Mean Square(RMS) calculation for $X_{0}$

Root Mean Square(RMS) calculation for $X_{0}$

RMS formula

$RMS(X_{0})={\sqrt {{\tfrac {1}{n}}\textstyle \sum _{i=1}^{n}\displaystyle X_{0}[i]^{2}}}$

Substitute the initial weights

$RMS(X_{0})={\sqrt {{\tfrac {1}{9}}(0.72^{2}+(-0.5)^{2}+0.9^{2}+(-1.1)^{2}+0.8^{2}+(-0.6)^{2}+1.2^{2}+(-0.7)^{2}+0.4^{2})}}$

$RMS(X_{0})={\sqrt {\frac {6.85}{9}}}\approx 0.806$

Step 1.2: Find the Learning Rate Scaling ( $\alpha _{t}$ ):

Learning rate formula

$\alpha _{1}=max(\epsilon _{2},RMS(X_{0}))\cdot p_{1}$

Substitute the RMS

$\alpha _{1}=max(0.001,0.806)\cdot 0.01=0.00806$

Step 2: Compute $G_{t}^{2}$ (Element-wise Square of Gradient)

Square the gradient value

$G_{t}^{2}={\begin{bmatrix}0.3^{2}&(-0.2)^{2}&0.4^{2}\\(-0.5)^{2}&0.6^{2}&(-0.1)^{2}\\0.2^{2}&(-0.4)^{2}&0.3^{2}\end{bmatrix}}$

$G_{t}^{2}={\begin{bmatrix}0.09&0.04&0.16\\0.25&0.36&0.01\\0.04&0.16&0.09\end{bmatrix}}$

Step 3: Find the moment estimate

Step 3.1: Compute row moments ( $R_{t}$ )

This equation computes the row-wise second moments ( $R_{t}$ ) as an exponential moving average of past moments ( $R_{t-1}$ ) and the current row-wise mean of squared gradients ( $G_{t}^{2}$ ), with a balance controlled by ( ${\hat {\beta }}_{2t}$ ).

For $G_{t}^{2}=\mathbb {R} ^{m\times n}$

$R_{t}={\hat {\beta _{2t}}}\cdot R_{t-1}+(1-{\hat {\beta }})\cdot ({\tfrac {1}{m}}\textstyle \sum _{j=1}^{m}\displaystyle G_{t}^{2}[i,j]+\epsilon _{1})$

Since ${\hat {\beta }}_{2t}=1-t^{-0.8}$ , for first iteration: ${\hat {\beta }}_{21}=0$ . And because $\epsilon _{1}$ is too small, we ignore it. The update of $R_{1}$ is:

$R_{1}={\tfrac {1}{m}}\textstyle \sum _{j=1}^{m}\displaystyle G_{t}^{2}[i,j]$

Row-wise mean ( $R_{t}$ ):

$R_{1}={\begin{bmatrix}{\tfrac {0.09+0.04+0.16}{3}}\\{\tfrac {0.25+0.36+0.01}{3}}\\{\tfrac {0.04+0.16+0.09}{3}}\end{bmatrix}}={\begin{bmatrix}0.0967\\0.2067\\0.0967\end{bmatrix}}$

Step 3.2: Compute column moments ( $C_{t}$ )

The prcoess is same as row moments

$C_{t}={\hat {\beta }}\cdot C_{t-1}+(1-{\hat {\beta }})\cdot ({\tfrac {1}{n}}\textstyle \sum _{j=1}^{n}\displaystyle G_{t}^{2}[i,j]+\epsilon _{1})$

Column Moments ( $C_{t}$ ):

$C_{1}={\begin{bmatrix}{\tfrac {0.09+025+0.04}{3}}\\{\tfrac {0.04+0.36+0.16}{3}}\\{\tfrac {0.16+0.01+0.09}{3}}\end{bmatrix}}={\begin{bmatrix}0.1267\\0.1867\\0.0867\end{bmatrix}}$

Step 3.3: Second Moment Estimate ( $V_{t}$ )

The Second Moment Estimate is calculated as the outer product of the row moments ( $R_{t}$ ) and column moments ( $C_{t}$ ).

$V_{t}=R_{t}\otimes C_{t}$

$V_{t}={\begin{bmatrix}0.0967\\0.2067\\0.0967\end{bmatrix}}\otimes {\begin{bmatrix}0.1267&0.1867&0.0867\\\end{bmatrix}}$

$V_{t}={\begin{bmatrix}0.0122&0.0180&0.0084\\0.0262&0.0386&0.0179\\0.0122&0.0180&0.0084\end{bmatrix}}$

Step 4: Update the vector ( $U_{t}$ )

step 4.1: Find the vector value of $U_{t}$

Formula of $U_{t}$

$U_{t}={\frac {G_{t}}{\sqrt {V_{t}+\epsilon _{1}}}}$

Substitute $C_{t}$ and $V_{t}$

$U_{1}={\frac {\begin{bmatrix}0.3&-0.2&0.4\\-0.5&0.6&-0.1\\0.2&-0.4&0.3\end{bmatrix}}{\sqrt {\begin{bmatrix}0.0122&0.0180&0.0084\\0.0262&0.0386&0.0179\\0.0122&0.0180&0.0084\end{bmatrix}}}}$