|
|
Line 90: |
Line 90: |
|
| |
|
| <math>G_t = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math> | | <math>G_t = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math> |
| | |
|
| |
|
|
| |
|
Line 103: |
Line 104: |
|
| |
|
| <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> (Second moment decay) | | <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> (Second moment decay) |
| | |
|
| |
|
|
| |
|
Line 134: |
Line 136: |
|
| |
|
| <math>\alpha_1 = max(0.001,0.806)\cdot 0.01=0.00806</math> | | <math>\alpha_1 = max(0.001,0.806)\cdot 0.01=0.00806</math> |
| | |
|
| |
|
|
| |
|
Line 141: |
Line 144: |
|
| |
|
| <math>G^{2}_t = \begin{bmatrix} 0.3^2&(-0.2)^2&0.4^2\\ (-0.5)^2&0.6^2&(-0.1)^2\\0.2^2&(-0.4)^2 &0.3^2 \end{bmatrix}</math> | | <math>G^{2}_t = \begin{bmatrix} 0.3^2&(-0.2)^2&0.4^2\\ (-0.5)^2&0.6^2&(-0.1)^2\\0.2^2&(-0.4)^2 &0.3^2 \end{bmatrix}</math> |
| | |
|
| |
|
|
| |
|
| <math>G^{2}_t = \begin{bmatrix} 0.09& 0.04&0.16\\ 0.25&0.36&0.01\\0.04&0.16&0.09\end{bmatrix}</math> | | <math>G^{2}_t = \begin{bmatrix} 0.09& 0.04&0.16\\ 0.25&0.36&0.01\\0.04&0.16&0.09\end{bmatrix}</math> |
| | |
|
| |
|
|
| |
|
| '''<big>Step 3: Find the moment estimate</big>''' | | '''<big>Step 3: Find the moment estimate</big>''' |
| | |
|
| |
|
|
| |
|
Line 164: |
Line 170: |
|
| |
|
| <math>R_1 = \begin{bmatrix} \tfrac{0.09+0.04+0.16}{3} \\ \tfrac{0.25+0.36+0.01}{3}\\\tfrac{0.04+0.16+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.0967\\ 0.2067\\0.0967\end{bmatrix} </math> | | <math>R_1 = \begin{bmatrix} \tfrac{0.09+0.04+0.16}{3} \\ \tfrac{0.25+0.36+0.01}{3}\\\tfrac{0.04+0.16+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.0967\\ 0.2067\\0.0967\end{bmatrix} </math> |
| | |
|
| |
|
|
| |
|
| '''Step 3.2: Compute column moments (<math>C_t</math>)''' | | '''Step 3.2: Compute column moments (<math>C_t</math>)''' |
|
| |
|
| The prcoess is same as row moments | | The process is same as row moments |
|
| |
|
| <math>C_t = \hat{\beta}\cdot C_{{t-1}} + (1-\hat{\beta})\cdot (\tfrac{1}{n}\textstyle \sum_{j=1}^n \displaystyle G^{2}_t[i,j]+\epsilon_1) </math> | | <math>C_t = \hat{\beta}\cdot C_{{t-1}} + (1-\hat{\beta})\cdot (\tfrac{1}{n}\textstyle \sum_{j=1}^n \displaystyle G^{2}_t[i,j]+\epsilon_1) </math> |
|
| |
|
| Column Moments ('''<math>C_t</math>'''): | | Column-wise mean: |
|
| |
|
| <math>C_1 = \begin{bmatrix} \tfrac{0.09+025+0.04}{3} \\ \tfrac{0.04+0.36+0.16}{3}\\\tfrac{0.16+0.01+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.1267\\ 0.1867\\0.0867\end{bmatrix} </math> | | <math>C_1 = \begin{bmatrix} \tfrac{0.09+025+0.04}{3} \\ \tfrac{0.04+0.36+0.16}{3}\\\tfrac{0.16+0.01+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.1267\\ 0.1867\\0.0867\end{bmatrix} </math> |
| | |
|
| |
|
|
| |
|
Line 187: |
Line 195: |
|
| |
|
| <math>V_t = \begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\ 0.0122&0.0180&0.0084\end{bmatrix} </math> | | <math>V_t = \begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\ 0.0122&0.0180&0.0084\end{bmatrix} </math> |
| | |
|
| |
|
|
| |
|
| '''<big>Step 4: Update the vector (<math>U_t </math>)</big>''' | | '''<big>Step 4: Update the vector (<math>U_t </math>)</big>''' |
| | |
|
| |
|
|
| |
|
Line 197: |
Line 207: |
|
| |
|
| <math>U_t = \frac{G_t}{\sqrt{V_t+\epsilon_1}} </math> | | <math>U_t = \frac{G_t}{\sqrt{V_t+\epsilon_1}} </math> |
| | |
|
| |
|
|
| |
|
Line 202: |
Line 213: |
|
| |
|
| <math>U_1 = \frac{\begin{bmatrix}0.3&-0.2&0.4 \\ -0.5&0.6&-0.1\\0.2&-0.4&0.3 \end{bmatrix}}{\sqrt{\begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\0.0122&0.0180&0.0084 \end{bmatrix}}} </math> | | <math>U_1 = \frac{\begin{bmatrix}0.3&-0.2&0.4 \\ -0.5&0.6&-0.1\\0.2&-0.4&0.3 \end{bmatrix}}{\sqrt{\begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\0.0122&0.0180&0.0084 \end{bmatrix}}} </math> |
| | |
|
| |
|
|
| |
|
| <math>U_1 = \begin{bmatrix} 2.711&-1.489&4.370\\-3.090&3.055&-0.747\\1.807&-2.978&3.278 \end{bmatrix} </math> | | <math>U_1 = \begin{bmatrix} 2.711&-1.489&4.370\\-3.090&3.055&-0.747\\1.807&-2.978&3.278 \end{bmatrix} </math> |
| | |
|
| |
|
|
| |
|
Line 212: |
Line 225: |
|
| |
|
| '''<small><math>\hat{U_t} = \frac{U_t}{max(1,\tfrac{RMS(U_t)}{d}) } </math></small>''' | | '''<small><math>\hat{U_t} = \frac{U_t}{max(1,\tfrac{RMS(U_t)}{d}) } </math></small>''' |
| | |
|
| |
|
|
| |
|
Line 217: |
Line 231: |
|
| |
|
| '''<small><math>RMS(U_t) = \sqrt{\tfrac{1}{9} \sum_{i=1}^9 U_t[i]^2} \approx 3.303 </math></small>''' | | '''<small><math>RMS(U_t) = \sqrt{\tfrac{1}{9} \sum_{i=1}^9 U_t[i]^2} \approx 3.303 </math></small>''' |
| | |
|
| |
|
|
| |
|
Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)
Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu
Introduction
Problem formulation
1. Objective
Minimize the loss function , where and is the weight vector to be optimized.
2. Parameters
- Where:
- is the running average of the squared gradient.
- is the corrected decay parameter.
- is a regularization constant.
- Where:
- is the relative step size.
- is a regularization constant.
- is the root mean square, defined as:
3. Algorithms
Adafactor for Weighted Vectors
Inputs:
- Initial point:
- Relative step sizes: for to
- Second moment decay: for to , with
- Regularization constants:
- Clipping threshold:
Algorithm:
- For to :
- Compute adaptive step size:
- Compute gradient:
- Update second moment estimate:
- Compute normalized gradient:
- Apply clipping:
- Update parameter:
- End for
Adafactor for Weighted Matrices
Inputs:
- Initial point:
- Relative step sizes: for to
- Second moment decay: for to , with
- Regularization constants:
- Clipping threshold:
Algorithm:
- For to :
- Compute adaptive step size:
- Compute gradient:
- Update row-wise second moment:
- Update column-wise second moment:
- Update overall second moment estimate:
- Compute normalized gradient:
- Apply clipping:
- Update parameter:
- End for
4. Proposed Hyperparameters for Adafactor
- Regularization constant 1:
- Regularization constant 2:
- Clipping threshold:
- Relative step size:
- Second moment decay:
Numerical Examples
Step-by-step instructions for determining the result of the first iteration.
Problem setup
Initial weights ():
Initial gradient ():
Gradient of the loss function with respect to X
Hyperparameters setup
(Minimum learning rate scaling factor))
(Regularization constant)
(Clipping threshold)
(Relative step size)
(Second moment decay)
Step 1: Learning Rate Scaling
Define the relative step size
Step 1.1: Root Mean Square(RMS) calculation for
Root Mean Square(RMS) calculation for
RMS formula
Substitute the initial weights
Step 1.2: Find the Learning Rate Scaling ():
Learning rate formula
Substitute the RMS
Step 2: Compute (Element-wise Square of Gradient)
Square the gradient value
Step 3: Find the moment estimate
Step 3.1: Compute row moments ()
This equation computes the row-wise second moments ( ) as an exponential moving average of past moments () and the current row-wise mean of squared gradients ( ), with a balance controlled by ().
For
Since , for first iteration: . And because is too small, we ignore it. The update of is:
Row-wise mean ():
Step 3.2: Compute column moments ()
The process is same as row moments
Column-wise mean:
Step 3.3: Second Moment Estimate ()
The Second Moment Estimate is calculated as the outer product of the row moments () and column moments ().
Step 4: Update the vector ()
step 4.1: Find the vector value of
Formula of
Substitute and
step 4.2: Clipped Update Vector
Formula of
Calculate RMS of
Since RMS()>d, scale by
Step 4: Weight Update ()
Applications
Conclusion
Reference