Adafactor
Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)
Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu
Introduction
Problem formulation
1. Objective
Minimize the loss function , where and is the weight vector to be optimized.
2. Parameters
- Gradient:
- Second moment estimate:
- Where:
- is the running average of the squared gradient.
- is the corrected decay parameter.
- is a regularization constant.
- Step size:
- Where:
- is the relative step size.
- is a regularization constant.
- is the root mean square, defined as:
3. Algorithms
Adafactor for Weighted Vectors
Inputs:
- Initial point:
- Relative step sizes: for to
- Second moment decay: for to , with
- Regularization constants:
- Clipping threshold:
Algorithm:
- For to :
- Compute adaptive step size:
- Compute gradient:
- Update second moment estimate:
- Compute normalized gradient:
- Apply clipping:
- Update parameter:
- End for
Adafactor for Weighted Matrices
Inputs:
- Initial point:
- Relative step sizes: for to
- Second moment decay: for to , with
- Regularization constants:
- Clipping threshold:
Algorithm:
- For to :
- Compute adaptive step size:
- Compute gradient:
- Update row-wise second moment:
- Update column-wise second moment:
- Update overall second moment estimate:
- Compute normalized gradient:
- Apply clipping:
- Update parameter:
- End for
4. Proposed Hyperparameters for Adafactor
- Regularization constant 1:
- Regularization constant 2:
- Clipping threshold:
- Relative step size:
- Second moment decay:
Numerical Examples
Step-by-step instructions for determining the result of the first iteration.
Problem setup
Initial weights ():
Gradient for first iteration ():
Gradient of the loss function with respect to X
Failed to parse (Conversion error. Server ("https://wikimedia.org/api/rest_") reported: "Cannot get mml. Server problem."): {\displaystyle G_{1}={\begin{bmatrix}0.3&-0.2&0.4\\-0.5&0.6&-0.1\\0.2&-0.4&0.3\end{bmatrix}}}
Hyperparameters setup
(Minimum learning rate scaling factor))
(Regularization constant)
(Clipping threshold)
(Relative step size)
(Second moment decay)
Step 1: Learning Rate Scaling
Define the relative step size
Failed to parse (Conversion error. Server ("https://wikimedia.org/api/rest_") reported: "Cannot get mml. Server problem."): {\displaystyle \rho _{1}=\min(10^{-2},1/{\sqrt {1}})=10^{-2}}
Step 1.1: Root Mean Square(RMS) calculation for
Root Mean Square(RMS) calculation for
RMS formula
Failed to parse (Conversion error. Server ("https://wikimedia.org/api/rest_") reported: "Cannot get mml. Server problem."): {\displaystyle RMS(X_{0})={\sqrt {{\tfrac {1}{n}}\sum _{i=1}^{n}X_{0}[i]^{2}}}}
Substitute the initial weights
Step 1.2: Find the Learning Rate Scaling ():
Learning rate formula
Failed to parse (Conversion error. Server ("https://wikimedia.org/api/rest_") reported: "Cannot get mml. Server problem."): {\displaystyle \alpha _{1}=max(\epsilon _{2},RMS(X_{0}))\cdot p_{1}}
Substitute the RMS
Step 2: Compute (Element-wise Square of Gradient)
Compute the squared value of each element in the gradient matrix .
Failed to parse (Conversion error. Server ("https://wikimedia.org/api/rest_") reported: "Cannot get mml. Server problem."): {\displaystyle G_{1}^{2}={\begin{bmatrix}0.3^{2}&(-0.2)^{2}&0.4^{2}\\(-0.5)^{2}&0.6^{2}&(-0.1)^{2}\\0.2^{2}&(-0.4)^{2}&0.3^{2}\end{bmatrix}}}
Step 3: Find the moment estimate
Compute the exponential moving average of squared gradients to capture the variance or scale of gradients.
Step 3.1: Compute row moments ()
This equation computes the row-wise second moments ( ) as an exponential moving average of past moments () and the current row-wise mean of squared gradients ( ), with a balance controlled by ().
For Failed to parse (Conversion error. Server ("https://wikimedia.org/api/rest_") reported: "Cannot get mml. Server problem."): {\displaystyle G_{t}^{2}=\mathbb {R} ^{m\times n}}
Failed to parse (Conversion error. Server ("https://wikimedia.org/api/rest_") reported: "Cannot get mml. Server problem."): {\displaystyle R_{t}={\hat {\beta _{2t}}}\cdot R_{t-1}+(1-{\hat {\beta }})\cdot ({\tfrac {1}{m}}\sum _{j=1}^{m}G_{t}^{2}[i,j]+\epsilon _{1})}
Since , for first iteration: . And because is too small, we can ignore it. The update of is:
Row-wise mean ():
Failed to parse (Conversion error. Server ("https://wikimedia.org/api/rest_") reported: "Cannot get mml. Server problem."): {\displaystyle R_{1}={\begin{bmatrix}{\tfrac {0.09+0.04+0.16}{3}}\\{\tfrac {0.25+0.36+0.01}{3}}\\{\tfrac {0.04+0.16+0.09}{3}}\end{bmatrix}}={\begin{bmatrix}0.0967\\0.2067\\0.0967\end{bmatrix}}}
Step 3.2: Compute column moments (Failed to parse (Conversion error. Server ("https://wikimedia.org/api/rest_") reported: "Cannot get mml. Server problem."): {\displaystyle C_{t}} )
The process is same as row moments.
Failed to parse (Conversion error. Server ("https://wikimedia.org/api/rest_") reported: "Cannot get mml. Server problem."): {\displaystyle C_{t}={\hat {\beta }}\cdot C_{t-1}+(1-{\hat {\beta }})\cdot ({\tfrac {1}{n}}\sum _{j=1}^{n}G_{t}^{2}[i,j]+\epsilon _{1})}
Column-wise mean (Failed to parse (Conversion error. Server ("https://wikimedia.org/api/rest_") reported: "Cannot get mml. Server problem."): {\displaystyle C_{t}} ):
Step 3.3: Second Moment Estimate ()
The Second Moment Estimate is calculated as the outer product of the row moments () and column moments (Failed to parse (Conversion error. Server ("https://wikimedia.org/api/rest_") reported: "Cannot get mml. Server problem."): {\displaystyle C_{t}} ).
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \hat{V}_t = R_t \otimes C_t}
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \hat{V}_1 = \begin{bmatrix} 0.0967\\0.2067\\0.0967 \end{bmatrix} \otimes \begin{bmatrix} 0.1267&0.1867&0.0867\\ \end{bmatrix} }
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \hat{V}_1 = \begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\ 0.0122&0.0180&0.0084\end{bmatrix} }
Step 4: Update the vector (Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle U_t } )
Computed by scaling the gradient matrix Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle G_t} element-wise with the inverse square root of the second moment estimate (Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \hat{V_t}} )
step 4.1: Find the vector value of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle U_t }
Formula of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle U_t }
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle U_t = \frac{G_t}{\sqrt{\hat{V_t}+\epsilon_1}} }
Substitute Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle C_t} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle V_t}
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle U_1 = \frac{\begin{bmatrix}0.3&-0.2&0.4 \\ -0.5&0.6&-0.1\\0.2&-0.4&0.3 \end{bmatrix}}{\sqrt{\begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\0.0122&0.0180&0.0084 \end{bmatrix}}} }
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle U_1 = \begin{bmatrix} 2.711&-1.489&4.370\\-3.090&3.055&-0.747\\1.807&-2.978&3.278 \end{bmatrix} }
step 4.2: Clipped Update Vector Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \hat{U_t} }
Scale the update vector ( Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle U_t } ) to ensure its RMS value does not exceed a predefined clipping threshold (Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle d } ), maintaining stability in updates.
Formula of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \hat{U_t} }
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \hat{U_t} = \frac{U_t}{max(1,\tfrac{RMS(U_t)}{d}) } }
Compute RMS of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle U_t }
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle RMS(U_1) = \sqrt{\tfrac{1}{9} \sum_{i=1}^9 U_t[i]^2} \approx 3.303 }
Since RMS(Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle U_t } )>d, scale Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle U_t } by Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \tfrac{1}{3.303} }
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \hat{U_1} = \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} }
Step 5: Weight Update (Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle X_1 }
)
Adjust the weights (Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle X_t } ) by subtracting the product of the learning rate (Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha_t } ) and the clipped update vector (Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \hat{U_t} } ).
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle X_1 = X_0 - \alpha \cdot \hat{U_t}}
The result for first iteration.
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle X_1 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix} - 0.00806 \cdot \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} }
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle X_1 = \begin{bmatrix} 0.692&-0.496&0.887 \\-1.091&0.791&-0.596\\ 1.195&-0.691&0.391\end{bmatrix} }