|
|
| Line 85: |
Line 85: |
| <math>X_0 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}</math> | | <math>X_0 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}</math> |
|
| |
|
| '''Initial gradient (<math>G_t</math>):''' | | '''Gradient for first iteration (<math>G_1</math>):''' |
|
| |
|
| Gradient of the loss function with respect to X | | Gradient of the loss function with respect to X |
|
| |
|
| <math>G_t = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math> | | <math>G_1 = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math> |
|
| |
|
| '''<big>Hyperparameters setup</big>''' | | '''<big>Hyperparameters setup</big>''' |
| Line 115: |
Line 115: |
| RMS formula | | RMS formula |
|
| |
|
| <math>RMS(X_0) = \sqrt{\tfrac{1}{n}\textstyle \sum_{i=1}^n\displaystyle X_0[i]^2}</math> | | <math>RMS(X_0) = \sqrt{\tfrac{1}{n}\sum_{i=1}^n X_0[i]^2}</math> |
|
| |
|
| Substitute the initial weights | | Substitute the initial weights |
| Line 137: |
Line 137: |
| Compute the squared value of each element in the gradient matrix '''<math>G_t</math>'''. | | Compute the squared value of each element in the gradient matrix '''<math>G_t</math>'''. |
|
| |
|
| <math>G^{2}_t = \begin{bmatrix} 0.3^2&(-0.2)^2&0.4^2\\ (-0.5)^2&0.6^2&(-0.1)^2\\0.2^2&(-0.4)^2 &0.3^2 \end{bmatrix}</math> | | <math>G^{2}_1 = \begin{bmatrix} 0.3^2&(-0.2)^2&0.4^2\\ (-0.5)^2&0.6^2&(-0.1)^2\\0.2^2&(-0.4)^2 &0.3^2 \end{bmatrix}</math> |
|
| |
|
|
| |
|
|
| |
|
| <math>G^{2}_t = \begin{bmatrix} 0.09& 0.04&0.16\\ 0.25&0.36&0.01\\0.04&0.16&0.09\end{bmatrix}</math> | | <math>G^{2}_1 = \begin{bmatrix} 0.09& 0.04&0.16\\ 0.25&0.36&0.01\\0.04&0.16&0.09\end{bmatrix}</math> |
| | |
| | |
|
| |
|
| '''<big>Step 3: Find the moment estimate</big>''' | | '''<big>Step 3: Find the moment estimate</big>''' |
| Line 159: |
Line 157: |
| Since <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>, for first iteration: <math>\hat{\beta}_{21} = 0</math>. And because <math>\epsilon_1 </math> is too small, we can ignore it. The update of '''<math>R_t</math>''' is: | | Since <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>, for first iteration: <math>\hat{\beta}_{21} = 0</math>. And because <math>\epsilon_1 </math> is too small, we can ignore it. The update of '''<math>R_t</math>''' is: |
|
| |
|
| <math>R_{1} = \tfrac{1}{m}\textstyle \sum_{j=1}^m \displaystyle G^{2}_t[i,j] </math> | | <math>R_{1} = \tfrac{1}{m}\textstyle \sum_{j=1}^m \displaystyle G^{2}_1[i,j] </math> |
|
| |
|
| Row-wise mean ('''<math>R_t</math>'''): | | Row-wise mean ('''<math>R_t</math>'''): |
|
| |
|
| <math>R_1 = \begin{bmatrix} \tfrac{0.09+0.04+0.16}{3} \\ \tfrac{0.25+0.36+0.01}{3}\\\tfrac{0.04+0.16+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.0967\\ 0.2067\\0.0967\end{bmatrix} </math> | | <math>R_1 = \begin{bmatrix} \tfrac{0.09+0.04+0.16}{3} \\ \tfrac{0.25+0.36+0.01}{3}\\\tfrac{0.04+0.16+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.0967\\ 0.2067\\0.0967\end{bmatrix} </math> |
|
| |
|
| |
|
| |
|
| '''Step 3.2: Compute column moments (<math>C_t</math>)''' | | '''Step 3.2: Compute column moments (<math>C_t</math>)''' |
|
| |
|
| The process is same as row moments | | The process is same as row moments. |
|
| |
|
| <math>C_t = \hat{\beta}\cdot C_{{t-1}} + (1-\hat{\beta})\cdot (\tfrac{1}{n}\sum_{j=1}^n G^{2}_t[i,j]+\epsilon_1) </math> | | <math>C_t = \hat{\beta}\cdot C_{{t-1}} + (1-\hat{\beta})\cdot (\tfrac{1}{n}\sum_{j=1}^n G^{2}_t[i,j]+\epsilon_1) </math> |
| Line 214: |
Line 210: |
|
| |
|
| '''step 4.2: Clipped Update Vector <math>\hat{U_t} </math>''' | | '''step 4.2: Clipped Update Vector <math>\hat{U_t} </math>''' |
| | |
| | Scale the update vector ( '''<math>U_t </math>''' ) to ensure its RMS value does not exceed a predefined clipping threshold (<math>d </math>), maintaining stability in updates. |
|
| |
|
| Formula of '''<small><math>\hat{U_t} </math></small>''' | | Formula of '''<small><math>\hat{U_t} </math></small>''' |
|
| |
|
| '''<small><math>\hat{U_t} = \frac{U_t}{max(1,\tfrac{RMS(U_t)}{d}) } </math></small>''' | | '''<small><math>\hat{U_t} = \frac{U_t}{max(1,\tfrac{RMS(U_t)}{d}) } </math></small>''' |
|
| |
|
| |
|
| |
|
| Compute RMS of '''<math>U_t </math>''' | | Compute RMS of '''<math>U_t </math>''' |
| Line 227: |
Line 223: |
| Since RMS('''<math>U_t </math>''')>d, scale '''<math>U_t </math>''' by <math>\tfrac{1}{3.303} </math> | | Since RMS('''<math>U_t </math>''')>d, scale '''<math>U_t </math>''' by <math>\tfrac{1}{3.303} </math> |
|
| |
|
| '''<math>\hat{U_t} = \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>''' | | '''<math>\hat{U_1} = \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>''' |
| | |
| | |
| | |
| '''<big>Step 4: Weight Update (</big>'''<math>X_1 </math>'''<big>)</big>'''
| |
|
| |
|
| | '''<big>Step 5: Weight Update (</big>'''<math>X_1 </math>'''<big>)</big>''' |
|
| |
|
| | Adjust the weights (<math>X_t </math>) by subtracting the product of the learning rate (<math>\alpha_t </math>) and the clipped update vector (<math>\hat{U_t} </math> ). |
|
| |
|
| <math>X_1 = X_0 - \alpha \cdot \hat{U_t}</math> | | <math>X_1 = X_0 - \alpha \cdot \hat{U_t}</math> |
|
| |
|
| The result for first iteration | | The result for first iteration. |
|
| |
|
| <math>X_1 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix} - 0.00806 \cdot \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math> | | <math>X_1 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix} - 0.00806 \cdot \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math> |
|
| |
|
| <math>X_1 = \begin{bmatrix} 0.692&-0.496&0.887 \\-1.091&0.791&-0.596\\ 1.195&-0.691&0.391\end{bmatrix} </math> | | <math>X_1 = \begin{bmatrix} 0.692&-0.496&0.887 \\-1.091&0.791&-0.596\\ 1.195&-0.691&0.391\end{bmatrix} </math> |
|
| |
|
| |
|
| |
|
| |
|
|
| |
|
Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)
Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu
Introduction
Problem formulation
1. Objective
Minimize the loss function
, where
and
is the weight vector to be optimized.
2. Parameters
- Where:
is the running average of the squared gradient.
is the corrected decay parameter.
is a regularization constant.
- Where:
is the relative step size.
is a regularization constant.
is the root mean square, defined as:


3. Algorithms
Adafactor for Weighted Vectors
Inputs:
- Initial point:

- Relative step sizes:
for
to 
- Second moment decay:
for
to
, with 
- Regularization constants:

- Clipping threshold:

Algorithm:
- For
to
:
- Compute adaptive step size:

- Compute gradient:

- Update second moment estimate:

- Compute normalized gradient:

- Apply clipping:

- Update parameter:

- End for
Adafactor for Weighted Matrices
Inputs:
- Initial point:

- Relative step sizes:
for
to 
- Second moment decay:
for
to
, with 
- Regularization constants:

- Clipping threshold:

Algorithm:
- For
to
:
- Compute adaptive step size:

- Compute gradient:

- Update row-wise second moment:

- Update column-wise second moment:

- Update overall second moment estimate:

- Compute normalized gradient:

- Apply clipping:

- Update parameter:

- End for
4. Proposed Hyperparameters for Adafactor
- Regularization constant 1:

- Regularization constant 2:

- Clipping threshold:

- Relative step size:

- Second moment decay:

Numerical Examples
Step-by-step instructions for determining the result of the first iteration.
Problem setup
Initial weights (
):
Gradient for first iteration (
):
Gradient of the loss function with respect to X
Hyperparameters setup
(Minimum learning rate scaling factor))
(Regularization constant)
(Clipping threshold)
(Relative step size)
(Second moment decay)
Step 1: Learning Rate Scaling
Define the relative step size
Step 1.1: Root Mean Square(RMS) calculation for
Root Mean Square(RMS) calculation for
RMS formula
Substitute the initial weights
Step 1.2: Find the Learning Rate Scaling (
):
Learning rate formula
Substitute the RMS
Step 2: Compute
(Element-wise Square of Gradient)
Compute the squared value of each element in the gradient matrix
.
Step 3: Find the moment estimate
Compute the exponential moving average of squared gradients to capture the variance or scale of gradients.
Step 3.1: Compute row moments (
)
This equation computes the row-wise second moments (
) as an exponential moving average of past moments (
) and the current row-wise mean of squared gradients (
), with a balance controlled by (
).
For
Since
, for first iteration:
. And because
is too small, we can ignore it. The update of
is:
Row-wise mean (
):
Step 3.2: Compute column moments (
)
The process is same as row moments.
Column-wise mean (
):
Step 3.3: Second Moment Estimate (
)
The Second Moment Estimate is calculated as the outer product of the row moments (
) and column moments (
).
Step 4: Update the vector (
)
Computed by scaling the gradient matrix
element-wise with the inverse square root of the second moment estimate (
)
step 4.1: Find the vector value of
Formula of
Substitute
and
step 4.2: Clipped Update Vector
Scale the update vector (
) to ensure its RMS value does not exceed a predefined clipping threshold (
), maintaining stability in updates.
Formula of
Compute RMS of
Since RMS(
)>d, scale
by
Step 5: Weight Update (
)
Adjust the weights (
) by subtracting the product of the learning rate (
) and the clipped update vector (
).
The result for first iteration.
Applications
Conclusion
Reference