Adafactor: Difference between revisions

From Cornell University Computational Optimization Open Textbook - Optimization Wiki
Jump to navigation Jump to search
Line 85: Line 85:
<math>X_0 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}</math>
<math>X_0 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}</math>


'''Initial gradient (​<math>G_t</math>):'''
'''Gradient for first iteration (​<math>G_1</math>):'''


Gradient of the loss function with respect to X
Gradient of the loss function with respect to X


<math>G_t = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math>
<math>G_1 = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math>


'''<big>Hyperparameters setup</big>'''
'''<big>Hyperparameters setup</big>'''
Line 115: Line 115:
RMS formula
RMS formula


<math>RMS(X_0) = \sqrt{\tfrac{1}{n}\textstyle \sum_{i=1}^n\displaystyle X_0[i]^2}</math>
<math>RMS(X_0) = \sqrt{\tfrac{1}{n}\sum_{i=1}^n  X_0[i]^2}</math>


Substitute the initial weights
Substitute the initial weights
Line 137: Line 137:
Compute the squared value of each element in the gradient matrix '''<math>G_t</math>'''.
Compute the squared value of each element in the gradient matrix '''<math>G_t</math>'''.


<math>G^{2}_t = \begin{bmatrix} 0.3^2&(-0.2)^2&0.4^2\\ (-0.5)^2&0.6^2&(-0.1)^2\\0.2^2&(-0.4)^2 &0.3^2 \end{bmatrix}</math>
<math>G^{2}_1 = \begin{bmatrix} 0.3^2&(-0.2)^2&0.4^2\\ (-0.5)^2&0.6^2&(-0.1)^2\\0.2^2&(-0.4)^2 &0.3^2 \end{bmatrix}</math>






<math>G^{2}_t = \begin{bmatrix} 0.09& 0.04&0.16\\ 0.25&0.36&0.01\\0.04&0.16&0.09\end{bmatrix}</math>
<math>G^{2}_1 = \begin{bmatrix} 0.09& 0.04&0.16\\ 0.25&0.36&0.01\\0.04&0.16&0.09\end{bmatrix}</math>
 
 


'''<big>Step 3: Find the moment estimate</big>'''
'''<big>Step 3: Find the moment estimate</big>'''
Line 159: Line 157:
Since <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>, for first iteration: <math>\hat{\beta}_{21} = 0</math>. And because <math>\epsilon_1 </math> is too small, we can ignore it. The update of '''<math>R_t</math>''' is:
Since <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>, for first iteration: <math>\hat{\beta}_{21} = 0</math>. And because <math>\epsilon_1 </math> is too small, we can ignore it. The update of '''<math>R_t</math>''' is:


<math>R_{1} = \tfrac{1}{m}\textstyle \sum_{j=1}^m \displaystyle G^{2}_t[i,j] </math>
<math>R_{1} = \tfrac{1}{m}\textstyle \sum_{j=1}^m \displaystyle G^{2}_1[i,j] </math>


Row-wise mean ('''<math>R_t</math>'''):
Row-wise mean ('''<math>R_t</math>'''):


<math>R_1 = \begin{bmatrix} \tfrac{0.09+0.04+0.16}{3} \\ \tfrac{0.25+0.36+0.01}{3}\\\tfrac{0.04+0.16+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.0967\\ 0.2067\\0.0967\end{bmatrix} </math>
<math>R_1 = \begin{bmatrix} \tfrac{0.09+0.04+0.16}{3} \\ \tfrac{0.25+0.36+0.01}{3}\\\tfrac{0.04+0.16+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.0967\\ 0.2067\\0.0967\end{bmatrix} </math>


'''Step 3.2: Compute column moments (<math>C_t</math>)'''
'''Step 3.2: Compute column moments (<math>C_t</math>)'''


The process is same as row moments
The process is same as row moments.


<math>C_t = \hat{\beta}\cdot C_{{t-1}} + (1-\hat{\beta})\cdot (\tfrac{1}{n}\sum_{j=1}^n G^{2}_t[i,j]+\epsilon_1) </math>
<math>C_t = \hat{\beta}\cdot C_{{t-1}} + (1-\hat{\beta})\cdot (\tfrac{1}{n}\sum_{j=1}^n G^{2}_t[i,j]+\epsilon_1) </math>
Line 214: Line 210:


'''step 4.2: Clipped Update Vector <math>\hat{U_t} </math>'''
'''step 4.2: Clipped Update Vector <math>\hat{U_t} </math>'''
Scale the update vector ( '''<math>U_t </math>'''​ ) to ensure its RMS value does not exceed a predefined clipping threshold (<math>d </math>), maintaining stability in updates.


Formula of '''<small><math>\hat{U_t} </math></small>'''
Formula of '''<small><math>\hat{U_t} </math></small>'''


'''<small><math>\hat{U_t} = \frac{U_t}{max(1,\tfrac{RMS(U_t)}{d})        } </math></small>'''
'''<small><math>\hat{U_t} = \frac{U_t}{max(1,\tfrac{RMS(U_t)}{d})        } </math></small>'''


Compute RMS of '''<math>U_t </math>'''
Compute RMS of '''<math>U_t </math>'''
Line 227: Line 223:
Since RMS('''<math>U_t </math>'''​)>d, scale '''<math>U_t </math>'''​ by <math>\tfrac{1}{3.303} </math>
Since RMS('''<math>U_t </math>'''​)>d, scale '''<math>U_t </math>'''​ by <math>\tfrac{1}{3.303} </math>


'''<math>\hat{U_t} =  \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>'''
'''<math>\hat{U_1} =  \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>'''
 
 
 
'''<big>Step 4: Weight Update (</big>'''<math>X_1 </math>'''<big>)</big>'''


'''<big>Step 5: Weight Update (</big>'''<math>X_1 </math>'''<big>)</big>'''


Adjust the weights (<math>X_t </math>) by subtracting the product of the learning rate (<math>\alpha_t </math>) and the clipped update vector (<math>\hat{U_t} </math> ).


<math>X_1 = X_0 - \alpha \cdot      \hat{U_t}</math>
<math>X_1 = X_0 - \alpha \cdot      \hat{U_t}</math>


The result for first iteration
The result for first iteration.


<math>X_1 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}  - 0.00806 \cdot      \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix}      </math>
<math>X_1 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}  - 0.00806 \cdot      \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix}      </math>


<math>X_1 =  \begin{bmatrix} 0.692&-0.496&0.887 \\-1.091&0.791&-0.596\\ 1.195&-0.691&0.391\end{bmatrix}      </math>
<math>X_1 =  \begin{bmatrix} 0.692&-0.496&0.887 \\-1.091&0.791&-0.596\\ 1.195&-0.691&0.391\end{bmatrix}      </math>





Revision as of 16:57, 11 December 2024

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

Introduction

Problem formulation

1. Objective

Minimize the loss function , where and is the weight vector to be optimized.

2. Parameters

  • Gradient:

  • Second moment estimate:

  • Where:
    • is the running average of the squared gradient.
    • is the corrected decay parameter.
    • is a regularization constant.
  • Step size:

  • Where:
    • is the relative step size.
    • is a regularization constant.
    • is the root mean square, defined as:

3. Algorithms

Adafactor for Weighted Vectors

Inputs:

  • Initial point:
  • Relative step sizes: for to
  • Second moment decay: for to , with
  • Regularization constants:
  • Clipping threshold:

Algorithm:

  • For to :
    • Compute adaptive step size:
    • Compute gradient:
    • Update second moment estimate:
    • Compute normalized gradient:
    • Apply clipping:
    • Update parameter:
  • End for

Adafactor for Weighted Matrices

Inputs:

  • Initial point:
  • Relative step sizes: for to
  • Second moment decay: for to , with
  • Regularization constants:
  • Clipping threshold:

Algorithm:

  • For to :
    • Compute adaptive step size:
    • Compute gradient:
    • Update row-wise second moment:
    • Update column-wise second moment:
    • Update overall second moment estimate:
    • Compute normalized gradient:
    • Apply clipping:
    • Update parameter:
  • End for

4. Proposed Hyperparameters for Adafactor

  • Regularization constant 1:
  • Regularization constant 2:
  • Clipping threshold:
  • Relative step size:
  • Second moment decay:

Numerical Examples

Step-by-step instructions for determining the result of the first iteration.

Problem setup

Initial weights ():

Gradient for first iteration (​):

Gradient of the loss function with respect to X

Hyperparameters setup

(Minimum learning rate scaling factor))

(Regularization constant)

(Clipping threshold)

(Relative step size)

(Second moment decay)

Step 1: Learning Rate Scaling

Define the relative step size

Step 1.1: Root Mean Square(RMS) calculation for

Root Mean Square(RMS) calculation for

RMS formula

Substitute the initial weights

Step 1.2: Find the Learning Rate Scaling ():

Learning rate formula

Substitute the RMS

Step 2: Compute ​ (Element-wise Square of Gradient)

Compute the squared value of each element in the gradient matrix .


Step 3: Find the moment estimate

Compute the exponential moving average of squared gradients to capture the variance or scale of gradients.

Step 3.1: Compute row moments ()

This equation computes the row-wise second moments ( ​) as an exponential moving average of past moments () and the current row-wise mean of squared gradients ( ​ ), with a balance controlled by ().

For

Since , for first iteration: . And because is too small, we can ignore it. The update of is:

Row-wise mean ():

Step 3.2: Compute column moments ()

The process is same as row moments.

Column-wise mean ():


Step 3.3: Second Moment Estimate ()

The Second Moment Estimate is calculated as the outer product of the row moments (​) and column moments (​).



Step 4: Update the vector ()

Computed by scaling the gradient matrix ​ element-wise with the inverse square root of the second moment estimate (​)

step 4.1: Find the vector value of

Formula of

Substitute and



step 4.2: Clipped Update Vector

Scale the update vector ( ​ ) to ensure its RMS value does not exceed a predefined clipping threshold (), maintaining stability in updates.

Formula of

Compute RMS of

Since RMS(​)>d, scale ​ by

Step 5: Weight Update ()

Adjust the weights () by subtracting the product of the learning rate () and the clipped update vector ( ).

The result for first iteration.




Applications

Conclusion

Reference