Cornell University Computational Optimization Open Textbook - Optimization Wiki - User contributions [en]

Adafactor

2024-12-16T02:46:28Z

Fall2024 Wiki Team6: /* Numerical Examples */

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

== Introduction ==
Adafactor is an efficient, adaptive learning rate optimization algorithm proposed by Noam Shazeer and Mitchell Stern in 2018. 1

Unlike traditional Adam optimizers, Adafactor does not store complete second-order moment matrices. Instead, it employs a factorization approach that only maintains gradient statistics for the rows and columns of parameter matrices, significantly reducing memory usage. Moreover, Adafactor uses an adaptive learning rate, allowing it to dynamically adjust step sizes without the need for manually setting a global learning rate or relying heavily on hyperparameter tuning. Its design also defaults to not performing bias correction, yet it remains stable in scenarios involving large-batch training data.1 This efficiency makes it an ideal choice for training ultra-large-scale models such as T5.2

Adafactor’s efficient memory usage and outstanding performance make it widely applicable in scenarios such as Natural Language Processing (NLP).2 Compared to the Adam optimizer, Adafactor significantly reduces memory and computational resource requirements while maintaining comparable performance when training large-scale language models and vision models. 3,6

== Problem formulation ==
=== 1. Objective ===
Minimize the loss function <math>f(x)</math>, where <math>x \in R^n</math> and <math>x</math> is the weight vector to be optimized.

=== 2. Parameters ===
*''' Gradient:'''
<math>G_t = \nabla f(x_{t-1})</math>

* '''Second moment estimate:'''

<math> \hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>

* '''Where:'''
** <math>\hat{V}_t</math> is the running average of the squared gradient.
**<math>\hat{\beta}_{2t}</math> is the corrected decay parameter.
**<math>\epsilon_1</math> is a regularization constant.

* '''Step size:'''
<math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math>
* '''Where''':
** <math>\rho_t</math> is the relative step size.
** <math>\epsilon_2</math> is a regularization constant.
** <math>\text{RMS}</math> is the root mean square, defined as:
*** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
*** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>

=== 3. Algorithms ===
==== Adafactor for Weighted Vectors ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^n</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update second moment estimate: <math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

==== Adafactor for Weighted Matrices ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^{n \times m}</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update row-wise second moment: <math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
** Update column-wise second moment: <math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
** Update overall second moment estimate: <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

=== Proposed Hyperparameters for Adafactor ===

* '''Regularization constant 1 (<math>\epsilon_1</math>):''' <math>10^{-30}</math>
**Ensures numerical stability by preventing division by zero in the calculation of second-moment estimates. This value is set extremely low to avoid instability in calculations.

* '''Regularization constant 2 (<math>\epsilon_2</math>):''' <math>10^{-3}</math>
**Helps stabilize parameter updates by controlling the scaling effect of second-moments in low-magnitude scenarios. This prevents instability caused by noise in small gradients.

* '''Clipping threshold (<math>d</math>):''' <math>1</math>
**A clipping threshold of 1 ensures stability by limiting large gradient values while maintaining sufficient learning efficiency. This avoids excessive suppression of large gradients, which could hinder learning.

* '''Relative step size (<math>\rho_t</math>):''' <math>\min(10^{-2}, 1 / \sqrt{t})</math>
**The <math>\min(10^{-2}, ...)</math> term caps the learning rate at <math>10^{-2}</math>, an empirically determined upper bound.
**The <math>1 / \sqrt{t}</math> term ensures convergence by reducing the step size over time, balancing exploration during early iterations with stability later in training.

* '''Second moment decay (<math>\hat{\beta}_{2t}</math>):''' <math>1 - t^{-0.8}</math>
**The decay factor remains close to 1 initially to allow rapid adaptation.
**The <math>t^{-0.8}</math> power balances between rapid learning in early training and stability during later stages, ensuring smoother convergence.

=== 5. Discussion ===

==== Why Clipping ====
Adafactor employs clipping to maintain numerical stability, especially since it is designed for use with very large models and often works with unscaled learning rates.
* Clipping prevents the update step from becoming very large, which would destabilize training
* Clipping mitigates the effects of very large gradients preventing numerical instability
Therefore, implementing clipping helps ensure stability and efficient training without requiring per-parameter scaling like Adam.

==== Why Adafactor is more memory efficient, compared to Adam ====
'''Row-wise and Column-wise Second Moment Updates'''
*<math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
*<math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
Instead of storing the full <math>G_t^2</math>, Adafactor computes the row and column respectively, which reduces the memory requirements from <math>O(n\times m)</math> to <math>O(n + m)</math>

'''Factored Representation of the Second Moment'''
* <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
This updates the second momentum based on the outer product <math>R_t C_t</math>.
*However, this is not <math>O(n\times m)</math> since
** The operation is performed element-wise, so it actually never materializes <math>\hat{V_t}</math> as a <math>n\times n</math> matrix
** It also only storing <math>R_t</math>and <math> C_t</math> instead of storage the full second-moment matrix

== Numerical Examples ==
Step-by-step instructions for determining the result of the first iteration.

=== Problem setup ===
'''Minimize the loss function:'''

<math>f(X) = \frac{1}{2}\sum_{i,j}(X_{ij}-C_{ij})^2</math>

'''Initial weights ('''<math>X_0</math>'''):'''

<math>X_0 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}</math>

'''Target matrix (<math>C</math>)：'''

<math>C = \begin{bmatrix} 0.4 & -0.3 &0.5 \\ -0.6 & 0.2&-1.5\\1.0&-0.3&0.1 \end{bmatrix}</math>

=== Hyperparameters setup ===
<math>\epsilon_1 = 10^{-30}</math> (Minimum learning rate scaling factor))

<math>\epsilon_2 = 10^{-3}</math> (Regularization constant)

<math>d = 1</math> (Clipping threshold)

<math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> (Relative step size)

<math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> (Second moment decay)

=== Step 1: Learning Rate Scaling ===
Define the relative step size

<math>\rho_1 = \min(10^{-2}, 1/\sqrt{1})= 10^{-2}</math>

'''Step 1.1: Root Mean Square(RMS) calculation for <math>X_0</math>'''

Root Mean Square(RMS) calculation for <math>X_0</math>

RMS formula

<math>RMS(X_0) = \sqrt{\tfrac{1}{n}\sum_{i=1}^n X_0[i]^2}</math>

Substitute the initial weights

<math>RMS(X_0) = \sqrt{\tfrac{1}{9}(0.7^2+(-0.5)^2+0.9^2+(-1.1)^2+0.8^2+(-0.6)^2+1.2^2+(-0.7)^2+0.4^2)}</math>

<math>RMS(X_0) = \sqrt{\frac{6.85}{9}}\approx 0.806</math>

'''Step 1.2: Find the Learning Rate Scaling ('''<math>\alpha_t</math>'''):'''

Learning rate formula

<math>\alpha_1 = max(\epsilon_2,RMS(X_0))\cdot p_1</math>

Substitute the RMS

<math>\alpha_1 = max(0.001,0.806)\cdot 0.01=0.00806</math>

=== Step 2: Compute the element-wise (Square of Gradient) ===
'''Step 2.1: Compute the gradient of the loss function'''

Gradient formula

<math>G_t = {\partial f(X)\over\partial X} = X_{t-1} - C</math>

Subtract C from <math>X_0</math>

<math>G_1 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix} - \begin{bmatrix} 0.4 & -0.3 &0.5 \\ -0.6 & 0.2&-1.5\\1.0&-0.3&0.1 \end{bmatrix}</math>

<math>G_1 = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math>

'''Step 2.2: Compute the squared value of each element in the gradient matrix <math>G_t</math>'''

<math>G^{2}_1 = \begin{bmatrix} 0.3^2&(-0.2)^2&0.4^2\\ (-0.5)^2&0.6^2&(-0.1)^2\\0.2^2&(-0.4)^2 &0.3^2 \end{bmatrix}</math>

<math>G^{2}_1 = \begin{bmatrix} 0.09& 0.04&0.16\\ 0.25&0.36&0.01\\0.04&0.16&0.09\end{bmatrix}</math>

=== Step 3: Find the moment estimate ===
Compute the exponential moving average of squared gradients to capture the variance or scale of gradients.

'''Step 3.1: Compute row moments (<math>R_t</math>)'''

This equation computes the row-wise second moments ('''<math>R_t</math>''' ) as an exponential moving average of past moments ('''<math>R_{t-1}</math>''') and the current row-wise mean of squared gradients ( <math>G^{2}_t</math> ), with a balance controlled by (<math>\hat{\beta}_{2t}</math>).

For <math>G^{2}_t=\mathbb{R}^{m\times n} </math>

<math>R_t = \hat{\beta_{2t}} \cdot R_{t-1} + (1-\hat{\beta})\cdot (\tfrac{1}{m}\sum_{j=1}^m G^{2}_t[i,j]+\epsilon_1) </math>

Since <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>, for first iteration: <math>\hat{\beta}_{21} = 0</math>. And because <math>\epsilon_1 </math> is too small, we can ignore it. The update of '''<math>R_t</math>''' is:

<math>R_{1} = \tfrac{1}{m}\textstyle \sum_{j=1}^m \displaystyle G^{2}_1[i,j] </math>

Row-wise mean ('''<math>R_t</math>'''):

<math>R_1 = \begin{bmatrix} \tfrac{0.09+0.04+0.16}{3} \\ \tfrac{0.25+0.36+0.01}{3}\\\tfrac{0.04+0.16+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.0967\\ 0.2067\\0.0967\end{bmatrix} </math>

'''Step 3.2: Compute column moments (<math>C_t</math>)'''

The process is same as row moments.

<math>C_t = \hat{\beta}\cdot C_{{t-1}} + (1-\hat{\beta})\cdot (\tfrac{1}{n}\sum_{j=1}^n G^{2}_t[i,j]+\epsilon_1) </math>

Column-wise mean (<math>C_t</math>):

<math>C_1 = \begin{bmatrix} \tfrac{0.09+025+0.04}{3} \\ \tfrac{0.04+0.36+0.16}{3}\\\tfrac{0.16+0.01+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.1267\\ 0.1867\\0.0867\end{bmatrix} </math>

'''Step 3.3: Second Moment Estimate ('''<math>\hat{V_t}</math>''')'''

The Second Moment Estimate is calculated as the outer product of the row moments ('''<math>R_t</math>''') and column moments ('''<math>C_t</math>''').

<math>\hat{V}_t = R_t \otimes C_t</math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0967\\0.2067\\0.0967 \end{bmatrix} \otimes \begin{bmatrix} 0.1267&0.1867&0.0867\\ \end{bmatrix} </math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\ 0.0122&0.0180&0.0084\end{bmatrix} </math>

=== Step 4: Update the vector ===
Computed by scaling the gradient matrix '''<math>G_t</math>''' element-wise with the inverse square root of the second moment estimate (<math>\hat{V_t}</math>)

'''Step 4.1: Find the vector value of <math>U_t </math>'''

Formula of '''<math>U_t </math>'''

<math>U_t = \frac{G_t}{\sqrt{\hat{V_t}+\epsilon_1}} </math>

Substitute '''<math>C_t</math>''' and <math>V_t</math>

<math>U_1 = \frac{\begin{bmatrix}0.3&-0.2&0.4 \\ -0.5&0.6&-0.1\\0.2&-0.4&0.3 \end{bmatrix}}{\sqrt{\begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\0.0122&0.0180&0.0084 \end{bmatrix}}} </math>

<math>U_1 = \begin{bmatrix} 2.711&-1.489&4.370\\-3.090&3.055&-0.747\\1.807&-2.978&3.278 \end{bmatrix} </math>

'''Step 4.2: Clipped Update Vector <math>\hat{U_t} </math>'''

Scale the update vector ( '''<math>U_t </math>''' ) to ensure its RMS value does not exceed a predefined clipping threshold (<math>d </math>), maintaining stability in updates.

Formula of '''<math>\hat{U_t} </math>'''

'''<math>\hat{U_t} = \frac{U_t}{max(1,\tfrac{RMS(U_t)}{d}) } </math>'''

Compute RMS of '''<math>U_t </math>'''

'''<math>RMS(U_1) = \sqrt{\tfrac{1}{9} \sum_{i=1}^9 U_t[i]^2} \approx 3.303 </math>'''

Since RMS('''<math>U_t </math>''')>d, scale '''<math>U_t </math>''' by <math>\tfrac{1}{3.303} </math>

'''<math>\hat{U_1} = \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>'''

=== Step 5: Weight Update ===
Adjust the weights (<math>X_t </math>) by subtracting the product of the learning rate (<math>\alpha_t </math>) and the clipped update vector (<math>\hat{U_t} </math> ).

<math>X_1 = X_0 - \alpha \cdot \hat{U_t}</math>

The result for first iteration.

<math>X_1 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix} - 0.00806 \cdot \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>

<math>X_1 = \begin{bmatrix} 0.692&-0.496&0.887 \\-1.091&0.791&-0.596\\ 1.195&-0.691&0.391\end{bmatrix} </math>

== Applications ==
Adafactor is an efficient adaptive optimizer designed specifically for large-scale deep learning tasks. Its unique memory-saving properties have made it widely used for training large-scale language models, image recognition models, and reinforcement learning policy networks. Compared to other optimizers (e.g., Adam), Adafactor delivers exceptional performance in large-scale computations while significantly reducing memory requirements. Below are several specific application scenarios of Adafactor:

'''1. Natural Language Processing (NLP)'''

In NLP tasks, Adafactor has been successfully applied to training ultra-large-scale language models, such as Google’s Transformer and T5 (Text-To-Text Transfer Transformer). By significantly reducing memory usage during the gradient update process, Adafactor enables efficient model training in resource-constrained environments. For example, the T5 model in Google’s research employed Adafactor to effectively train on large datasets through text-to-text conversion tasks.2

'''2. Training Large-Scale Language Models'''

Adafactor has been used to train large-scale language models like LLaMA, combining it with novel preconditioned diagonalization methods to significantly enhance training efficiency. Experiments showed that Adafactor achieved performance comparable to the Adam optimizer while consuming substantially less memory and computational resources.3

'''3. Humor Detection Tasks'''

Adafactor has been utilized to optimize ALBERT-based models for humor detection tasks. Configured as an adaptive learning rate optimizer and paired with a cross-entropy loss function, Adafactor was used to train models that achieved 99% accuracy and F1 scores. Moreover, training time was faster than with Adam, completing in approximately 43 minutes. Comparisons with Adam and AdaBound optimizers demonstrated that Adafactor excelled in terms of both time efficiency and performance, especially in accuracy, recall, and F1 scores for humor detection tasks .4

'''4. Multilingual Model Training'''

In training multilingual models, Adafactor improved scalability and efficiency, particularly by significantly reducing memory consumption when handling large-scale parameters.5

'''5. Pretraining Vision Models'''

When training ResNet50 and ViT on the ImageNet1k dataset, Adafactor successfully optimized these deep networks with its low memory requirements. Additionally, with new algorithms combining preconditioned diagonalization methods (e.g., AdafacDiag and AdafacDiag++), it outperformed the standard Adam optimizer in both convergence speed and final accuracy.6

==== Software Tools and Platforms ====
Adafactor has been integrated into the following mainstream deep learning frameworks, making it accessible to developers:

'''TensorFlow''': Provides a built-in implementation of Adafactor.7

'''PyTorch:''' PyTorch provides the Adafactor optimizer through the torch.optim.AdaFactor class.8

'''JAX/Flax:''' JAX provides an optimizer library called Optax, which includes the Adafactor optimizer.9

==== Future Prospects ====
As the scale of deep learning models continues to grow, Adafactor’s memory-saving and computational efficiency advantages will become increasingly important. In the training of ultra-large-scale models (e.g., GPT and Vision Transformers), Adafactor is expected to become an indispensable optimization tool. Furthermore, by combining with other optimization strategies, such as mixed precision training, Adafactor may further enhance its applicability in both industrial and research settings.

== Conclusion ==
Adafactor addresses the memory consumption challenge of training large-scale deep learning models. By factorizing the second-order moment matrix and dynamically adjusting the learning rate, Adafactor minimizes resource usage without compromising performance. Adafactor can be applied to the training tasks of large language models such as Transformers, T5 models, and Vision Transformers.

== Reference ==

# Shazeer, Noam, and Mitchell Stern. "Adafactor: Adaptive learning rates with sublinear memory cost." ''International Conference on Machine Learning''. PMLR, 2018.
# Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." ''Journal of machine learning research'' 21.140 (2020): 1-67.
# DIAGONALIZATION, VIA PRECONDITIONER. "Improving Adaptive Moment Optimization via Preconditioner Diagonalization."
# Chauhan, Tavishee, and Hemant Palivela. "The Fine tuning of Language models for automation of Humor Detection." ''INFOCOMP Journal of Computer Science'' 20.2 (2021).
# Lepikhin, Dmitry, et al. "Gshard: Scaling giant models with conditional computation and automatic sharding." ''arXiv preprint arXiv:2006.16668'' (2020).
# DIAGONALIZATION, VIA PRECONDITIONER. "Improving Adaptive Moment Optimization via Preconditioner Diagonalization."
# <nowiki>https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adafactor</nowiki>
# <nowiki>https://pytorch.org/docs/stable/generated/torch.optim.Adafactor.html</nowiki>
# <nowiki>https://flax.readthedocs.io/en/v0.5.3/_autosummary/flax.optim.Adafactor.html</nowiki>

Adafactor

2024-12-16T02:43:31Z

Fall2024 Wiki Team6: /* Step 2: Compute G t 2 {\displaystyle G_{t}^{2}} (Element-wise Square of Gradient) */

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

== Introduction ==
Adafactor is an efficient, adaptive learning rate optimization algorithm proposed by Noam Shazeer and Mitchell Stern in 2018. 1

Unlike traditional Adam optimizers, Adafactor does not store complete second-order moment matrices. Instead, it employs a factorization approach that only maintains gradient statistics for the rows and columns of parameter matrices, significantly reducing memory usage. Moreover, Adafactor uses an adaptive learning rate, allowing it to dynamically adjust step sizes without the need for manually setting a global learning rate or relying heavily on hyperparameter tuning. Its design also defaults to not performing bias correction, yet it remains stable in scenarios involving large-batch training data.1 This efficiency makes it an ideal choice for training ultra-large-scale models such as T5.2

Adafactor’s efficient memory usage and outstanding performance make it widely applicable in scenarios such as Natural Language Processing (NLP).2 Compared to the Adam optimizer, Adafactor significantly reduces memory and computational resource requirements while maintaining comparable performance when training large-scale language models and vision models. 3,6

== Problem formulation ==
=== 1. Objective ===
Minimize the loss function <math>f(x)</math>, where <math>x \in R^n</math> and <math>x</math> is the weight vector to be optimized.

=== 2. Parameters ===
*''' Gradient:'''
<math>G_t = \nabla f(x_{t-1})</math>

* '''Second moment estimate:'''

<math> \hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>

* '''Where:'''
** <math>\hat{V}_t</math> is the running average of the squared gradient.
**<math>\hat{\beta}_{2t}</math> is the corrected decay parameter.
**<math>\epsilon_1</math> is a regularization constant.

* '''Step size:'''
<math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math>
* '''Where''':
** <math>\rho_t</math> is the relative step size.
** <math>\epsilon_2</math> is a regularization constant.
** <math>\text{RMS}</math> is the root mean square, defined as:
*** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
*** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>

=== 3. Algorithms ===
==== Adafactor for Weighted Vectors ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^n</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update second moment estimate: <math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

==== Adafactor for Weighted Matrices ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^{n \times m}</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update row-wise second moment: <math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
** Update column-wise second moment: <math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
** Update overall second moment estimate: <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

=== Proposed Hyperparameters for Adafactor ===

* '''Regularization constant 1 (<math>\epsilon_1</math>):''' <math>10^{-30}</math>
**Ensures numerical stability by preventing division by zero in the calculation of second-moment estimates. This value is set extremely low to avoid instability in calculations.

* '''Regularization constant 2 (<math>\epsilon_2</math>):''' <math>10^{-3}</math>
**Helps stabilize parameter updates by controlling the scaling effect of second-moments in low-magnitude scenarios. This prevents instability caused by noise in small gradients.

* '''Clipping threshold (<math>d</math>):''' <math>1</math>
**A clipping threshold of 1 ensures stability by limiting large gradient values while maintaining sufficient learning efficiency. This avoids excessive suppression of large gradients, which could hinder learning.

* '''Relative step size (<math>\rho_t</math>):''' <math>\min(10^{-2}, 1 / \sqrt{t})</math>
**The <math>\min(10^{-2}, ...)</math> term caps the learning rate at <math>10^{-2}</math>, an empirically determined upper bound.
**The <math>1 / \sqrt{t}</math> term ensures convergence by reducing the step size over time, balancing exploration during early iterations with stability later in training.

* '''Second moment decay (<math>\hat{\beta}_{2t}</math>):''' <math>1 - t^{-0.8}</math>
**The decay factor remains close to 1 initially to allow rapid adaptation.
**The <math>t^{-0.8}</math> power balances between rapid learning in early training and stability during later stages, ensuring smoother convergence.

=== 5. Discussion ===

==== Why Clipping ====
Adafactor employs clipping to maintain numerical stability, especially since it is designed for use with very large models and often works with unscaled learning rates.
* Clipping prevents the update step from becoming very large, which would destabilize training
* Clipping mitigates the effects of very large gradients preventing numerical instability
Therefore, implementing clipping helps ensure stability and efficient training without requiring per-parameter scaling like Adam.

==== Why Adafactor is more memory efficient, compared to Adam ====
'''Row-wise and Column-wise Second Moment Updates'''
*<math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
*<math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
Instead of storing the full <math>G_t^2</math>, Adafactor computes the row and column respectively, which reduces the memory requirements from <math>O(n\times m)</math> to <math>O(n + m)</math>

'''Factored Representation of the Second Moment'''
* <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
This updates the second momentum based on the outer product <math>R_t C_t</math>.
*However, this is not <math>O(n\times m)</math> since
** The operation is performed element-wise, so it actually never materializes <math>\hat{V_t}</math> as a <math>n\times n</math> matrix
** It also only storing <math>R_t</math>and <math> C_t</math> instead of storage the full second-moment matrix

== Numerical Examples ==
Step-by-step instructions for determining the result of the first iteration.

=== Problem setup ===
'''Minimize the loss function:'''

<math>f(X) = \frac{1}{2}\sum_{i,j}(X_{ij}-C_{ij})^2</math>

'''Initial weights ('''<math>X_0</math>'''):'''

<math>X_0 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}</math>

'''Target matrix (<math>C</math>)：'''

<math>C = \begin{bmatrix} 0.4 & -0.3 &0.5 \\ -0.6 & 0.2&-1.5\\1.0&-0.3&0.1 \end{bmatrix}</math>

=== Hyperparameters setup ===
<math>\epsilon_1 = 10^{-30}</math> (Minimum learning rate scaling factor))

<math>\epsilon_2 = 10^{-3}</math> (Regularization constant)

<math>d = 1</math> (Clipping threshold)

<math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> (Relative step size)

<math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> (Second moment decay)

=== Step 1: Learning Rate Scaling ===
Define the relative step size

<math>\rho_1 = \min(10^{-2}, 1/\sqrt{1})= 10^{-2}</math>

'''Step 1.1: Root Mean Square(RMS) calculation for <math>X_0</math>'''

Root Mean Square(RMS) calculation for <math>X_0</math>

RMS formula

<math>RMS(X_0) = \sqrt{\tfrac{1}{n}\sum_{i=1}^n X_0[i]^2}</math>

Substitute the initial weights

<math>RMS(X_0) = \sqrt{\tfrac{1}{9}(0.7^2+(-0.5)^2+0.9^2+(-1.1)^2+0.8^2+(-0.6)^2+1.2^2+(-0.7)^2+0.4^2)}</math>

<math>RMS(X_0) = \sqrt{\frac{6.85}{9}}\approx 0.806</math>

'''Step 1.2: Find the Learning Rate Scaling ('''<math>\alpha_t</math>'''):'''

Learning rate formula

<math>\alpha_1 = max(\epsilon_2,RMS(X_0))\cdot p_1</math>

Substitute the RMS

<math>\alpha_1 = max(0.001,0.806)\cdot 0.01=0.00806</math>

=== Step 2: Compute the element-wise (Square of Gradient) ===
'''Step 2.1: Compute the gradient of the loss function'''

Gradient formula

<math>G_t = {\partial f(X)\over\partial X} = X_{t-1} - C</math>

Subtract C from <math>X_0</math>

<math>G_1 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix} - \begin{bmatrix} 0.4 & -0.3 &0.5 \\ -0.6 & 0.2&-1.5\\1.0&-0.3&0.1 \end{bmatrix}</math>

<math>G_1 = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math>

'''Step 2.2: Compute the squared value of each element in the gradient matrix <math>G_t</math>'''

<math>G^{2}_1 = \begin{bmatrix} 0.3^2&(-0.2)^2&0.4^2\\ (-0.5)^2&0.6^2&(-0.1)^2\\0.2^2&(-0.4)^2 &0.3^2 \end{bmatrix}</math>

<math>G^{2}_1 = \begin{bmatrix} 0.09& 0.04&0.16\\ 0.25&0.36&0.01\\0.04&0.16&0.09\end{bmatrix}</math>

=== Step 3: Find the moment estimate ===
Compute the exponential moving average of squared gradients to capture the variance or scale of gradients.

'''Step 3.1: Compute row moments (<math>R_t</math>)'''

This equation computes the row-wise second moments ('''<math>R_t</math>''' ) as an exponential moving average of past moments ('''<math>R_{t-1}</math>''') and the current row-wise mean of squared gradients ( <math>G^{2}_t</math> ), with a balance controlled by (<math>\hat{\beta}_{2t}</math>).

For <math>G^{2}_t=\mathbb{R}^{m\times n} </math>

<math>R_t = \hat{\beta_{2t}} \cdot R_{t-1} + (1-\hat{\beta})\cdot (\tfrac{1}{m}\sum_{j=1}^m G^{2}_t[i,j]+\epsilon_1) </math>

Since <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>, for first iteration: <math>\hat{\beta}_{21} = 0</math>. And because <math>\epsilon_1 </math> is too small, we can ignore it. The update of '''<math>R_t</math>''' is:

<math>R_{1} = \tfrac{1}{m}\textstyle \sum_{j=1}^m \displaystyle G^{2}_1[i,j] </math>

Row-wise mean ('''<math>R_t</math>'''):

<math>R_1 = \begin{bmatrix} \tfrac{0.09+0.04+0.16}{3} \\ \tfrac{0.25+0.36+0.01}{3}\\\tfrac{0.04+0.16+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.0967\\ 0.2067\\0.0967\end{bmatrix} </math>

'''Step 3.2: Compute column moments (<math>C_t</math>)'''

The process is same as row moments.

<math>C_t = \hat{\beta}\cdot C_{{t-1}} + (1-\hat{\beta})\cdot (\tfrac{1}{n}\sum_{j=1}^n G^{2}_t[i,j]+\epsilon_1) </math>

Column-wise mean (<math>C_t</math>):

<math>C_1 = \begin{bmatrix} \tfrac{0.09+025+0.04}{3} \\ \tfrac{0.04+0.36+0.16}{3}\\\tfrac{0.16+0.01+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.1267\\ 0.1867\\0.0867\end{bmatrix} </math>

'''Step 3.3: Second Moment Estimate ('''<math>\hat{V_t}</math>''')'''

The Second Moment Estimate is calculated as the outer product of the row moments ('''<math>R_t</math>''') and column moments ('''<math>C_t</math>''').

<math>\hat{V}_t = R_t \otimes C_t</math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0967\\0.2067\\0.0967 \end{bmatrix} \otimes \begin{bmatrix} 0.1267&0.1867&0.0867\\ \end{bmatrix} </math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\ 0.0122&0.0180&0.0084\end{bmatrix} </math>

=== Step 4: Update the vector ===
Computed by scaling the gradient matrix '''<math>G_t</math>''' element-wise with the inverse square root of the second moment estimate (<math>\hat{V_t}</math>)

'''Step 4.1: Find the vector value of <math>U_t </math>'''

Formula of '''<math>U_t </math>'''

<math>U_t = \frac{G_t}{\sqrt{\hat{V_t}+\epsilon_1}} </math>

Substitute '''<math>C_t</math>''' and <math>V_t</math>

<math>U_1 = \frac{\begin{bmatrix}0.3&-0.2&0.4 \\ -0.5&0.6&-0.1\\0.2&-0.4&0.3 \end{bmatrix}}{\sqrt{\begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\0.0122&0.0180&0.0084 \end{bmatrix}}} </math>

<math>U_1 = \begin{bmatrix} 2.711&-1.489&4.370\\-3.090&3.055&-0.747\\1.807&-2.978&3.278 \end{bmatrix} </math>

'''Step 4.2: Clipped Update Vector <math>\hat{U_t} </math>'''

Scale the update vector ( '''<math>U_t </math>''' ) to ensure its RMS value does not exceed a predefined clipping threshold (<math>d </math>), maintaining stability in updates.

Formula of '''<math>\hat{U_t} </math>'''

'''<math>\hat{U_t} = \frac{U_t}{max(1,\tfrac{RMS(U_t)}{d}) } </math>'''

Compute RMS of '''<math>U_t </math>'''

'''<math>RMS(U_1) = \sqrt{\tfrac{1}{9} \sum_{i=1}^9 U_t[i]^2} \approx 3.303 </math>'''

Since RMS('''<math>U_t </math>''')>d, scale '''<math>U_t </math>''' by <math>\tfrac{1}{3.303} </math>

'''<math>\hat{U_1} = \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>'''

=== Step 5: Weight Update ===
Adjust the weights (<math>X_t </math>) by subtracting the product of the learning rate (<math>\alpha_t </math>) and the clipped update vector (<math>\hat{U_t} </math> ).

<math>X_1 = X_0 - \alpha \cdot \hat{U_t}</math>

The result for first iteration.

<math>X_1 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix} - 0.00806 \cdot \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>

<math>X_1 = \begin{bmatrix} 0.692&-0.496&0.887 \\-1.091&0.791&-0.596\\ 1.195&-0.691&0.391\end{bmatrix} </math>

== Applications ==
Adafactor is an efficient adaptive optimizer designed specifically for large-scale deep learning tasks. Its unique memory-saving properties have made it widely used for training large-scale language models, image recognition models, and reinforcement learning policy networks. Compared to other optimizers (e.g., Adam), Adafactor delivers exceptional performance in large-scale computations while significantly reducing memory requirements. Below are several specific application scenarios of Adafactor:

'''1. Natural Language Processing (NLP)'''

In NLP tasks, Adafactor has been successfully applied to training ultra-large-scale language models, such as Google’s Transformer and T5 (Text-To-Text Transfer Transformer). By significantly reducing memory usage during the gradient update process, Adafactor enables efficient model training in resource-constrained environments. For example, the T5 model in Google’s research employed Adafactor to effectively train on large datasets through text-to-text conversion tasks.2

'''2. Training Large-Scale Language Models'''

Adafactor has been used to train large-scale language models like LLaMA, combining it with novel preconditioned diagonalization methods to significantly enhance training efficiency. Experiments showed that Adafactor achieved performance comparable to the Adam optimizer while consuming substantially less memory and computational resources.3

'''3. Humor Detection Tasks'''

Adafactor has been utilized to optimize ALBERT-based models for humor detection tasks. Configured as an adaptive learning rate optimizer and paired with a cross-entropy loss function, Adafactor was used to train models that achieved 99% accuracy and F1 scores. Moreover, training time was faster than with Adam, completing in approximately 43 minutes. Comparisons with Adam and AdaBound optimizers demonstrated that Adafactor excelled in terms of both time efficiency and performance, especially in accuracy, recall, and F1 scores for humor detection tasks .4

'''4. Multilingual Model Training'''

In training multilingual models, Adafactor improved scalability and efficiency, particularly by significantly reducing memory consumption when handling large-scale parameters.5

'''5. Pretraining Vision Models'''

When training ResNet50 and ViT on the ImageNet1k dataset, Adafactor successfully optimized these deep networks with its low memory requirements. Additionally, with new algorithms combining preconditioned diagonalization methods (e.g., AdafacDiag and AdafacDiag++), it outperformed the standard Adam optimizer in both convergence speed and final accuracy.6

==== '''Software Tools and Platforms''' ====
Adafactor has been integrated into the following mainstream deep learning frameworks, making it accessible to developers:

'''TensorFlow''': Provides a built-in implementation of Adafactor.7

'''PyTorch:''' PyTorch provides the Adafactor optimizer through the torch.optim.AdaFactor class.8

'''JAX/Flax:''' JAX provides an optimizer library called Optax, which includes the Adafactor optimizer.9

==== '''Future Prospects''' ====
As the scale of deep learning models continues to grow, Adafactor’s memory-saving and computational efficiency advantages will become increasingly important. In the training of ultra-large-scale models (e.g., GPT and Vision Transformers), Adafactor is expected to become an indispensable optimization tool. Furthermore, by combining with other optimization strategies, such as mixed precision training, Adafactor may further enhance its applicability in both industrial and research settings.

== Conclusion ==
Adafactor addresses the memory consumption challenge of training large-scale deep learning models. By factorizing the second-order moment matrix and dynamically adjusting the learning rate, Adafactor minimizes resource usage without compromising performance. Adafactor can be applied to the training tasks of large language models such as Transformers, T5 models, and Vision Transformers.

== Reference ==

# Shazeer, Noam, and Mitchell Stern. "Adafactor: Adaptive learning rates with sublinear memory cost." ''International Conference on Machine Learning''. PMLR, 2018.
# Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." ''Journal of machine learning research'' 21.140 (2020): 1-67.
# DIAGONALIZATION, VIA PRECONDITIONER. "Improving Adaptive Moment Optimization via Preconditioner Diagonalization."
# Chauhan, Tavishee, and Hemant Palivela. "The Fine tuning of Language models for automation of Humor Detection." ''INFOCOMP Journal of Computer Science'' 20.2 (2021).
# Lepikhin, Dmitry, et al. "Gshard: Scaling giant models with conditional computation and automatic sharding." ''arXiv preprint arXiv:2006.16668'' (2020).
# DIAGONALIZATION, VIA PRECONDITIONER. "Improving Adaptive Moment Optimization via Preconditioner Diagonalization."
# <nowiki>https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adafactor</nowiki>
# <nowiki>https://pytorch.org/docs/stable/generated/torch.optim.Adafactor.html</nowiki>
# <nowiki>https://flax.readthedocs.io/en/v0.5.3/_autosummary/flax.optim.Adafactor.html</nowiki>

Adafactor

2024-12-16T02:02:27Z

Fall2024 Wiki Team6: /* Problem setup */

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

== Introduction ==
Adafactor is an efficient, adaptive learning rate optimization algorithm proposed by Noam Shazeer and Mitchell Stern in 2018. 1

Unlike traditional Adam optimizers, Adafactor does not store complete second-order moment matrices. Instead, it employs a factorization approach that only maintains gradient statistics for the rows and columns of parameter matrices, significantly reducing memory usage. Moreover, Adafactor uses an adaptive learning rate, allowing it to dynamically adjust step sizes without the need for manually setting a global learning rate or relying heavily on hyperparameter tuning. Its design also defaults to not performing bias correction, yet it remains stable in scenarios involving large-batch training data.1 This efficiency makes it an ideal choice for training ultra-large-scale models such as T5.2

Adafactor’s efficient memory usage and outstanding performance make it widely applicable in scenarios such as Natural Language Processing (NLP).2 Compared to the Adam optimizer, Adafactor significantly reduces memory and computational resource requirements while maintaining comparable performance when training large-scale language models and vision models. 3,6

== Problem formulation ==
=== 1. Objective ===
Minimize the loss function <math>f(x)</math>, where <math>x \in R^n</math> and <math>x</math> is the weight vector to be optimized.

=== 2. Parameters ===
*''' Gradient:'''
<math>G_t = \nabla f(x_{t-1})</math>

* '''Second moment estimate:'''

<math> \hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>

* '''Where:'''
** <math>\hat{V}_t</math> is the running average of the squared gradient.
**<math>\hat{\beta}_{2t}</math> is the corrected decay parameter.
**<math>\epsilon_1</math> is a regularization constant.

* '''Step size:'''
<math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math>
* '''Where''':
** <math>\rho_t</math> is the relative step size.
** <math>\epsilon_2</math> is a regularization constant.
** <math>\text{RMS}</math> is the root mean square, defined as:
*** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
*** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>

=== 3. Algorithms ===
==== Adafactor for Weighted Vectors ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^n</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update second moment estimate: <math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

==== Adafactor for Weighted Matrices ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^{n \times m}</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update row-wise second moment: <math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
** Update column-wise second moment: <math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
** Update overall second moment estimate: <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

=== Proposed Hyperparameters for Adafactor ===

* '''Regularization constant 1 (<math>\epsilon_1</math>):''' <math>10^{-30}</math>
**Ensures numerical stability by preventing division by zero in the calculation of second-moment estimates. This value is set extremely low to avoid instability in calculations.

* '''Regularization constant 2 (<math>\epsilon_2</math>):''' <math>10^{-3}</math>
**Helps stabilize parameter updates by controlling the scaling effect of second-moments in low-magnitude scenarios. This prevents instability caused by noise in small gradients.

* '''Clipping threshold (<math>d</math>):''' <math>1</math>
**A clipping threshold of 1 ensures stability by limiting large gradient values while maintaining sufficient learning efficiency. This avoids excessive suppression of large gradients, which could hinder learning.

* '''Relative step size (<math>\rho_t</math>):''' <math>\min(10^{-2}, 1 / \sqrt{t})</math>
**The <math>\min(10^{-2}, ...)</math> term caps the learning rate at <math>10^{-2}</math>, an empirically determined upper bound.
**The <math>1 / \sqrt{t}</math> term ensures convergence by reducing the step size over time, balancing exploration during early iterations with stability later in training.

* '''Second moment decay (<math>\hat{\beta}_{2t}</math>):''' <math>1 - t^{-0.8}</math>
**The decay factor remains close to 1 initially to allow rapid adaptation.
**The <math>t^{-0.8}</math> power balances between rapid learning in early training and stability during later stages, ensuring smoother convergence.

=== 5. Discussion ===

==== Why Clipping ====
Adafactor employs clipping to maintain numerical stability, especially since it is designed for use with very large models and often works with unscaled learning rates.
* Clipping prevents the update step from becoming very large, which would destabilize training
* Clipping mitigates the effects of very large gradients preventing numerical instability
Therefore, implementing clipping helps ensure stability and efficient training without requiring per-parameter scaling like Adam.

==== Why Adafactor is more memory efficient, compared to Adam ====
'''Row-wise and Column-wise Second Moment Updates'''
*<math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
*<math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
Instead of storing the full <math>G_t^2</math>, Adafactor computes the row and column respectively, which reduces the memory requirements from <math>O(n\times m)</math> to <math>O(n + m)</math>

'''Factored Representation of the Second Moment'''
* <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
This updates the second momentum based on the outer product <math>R_t C_t</math>.
*However, this is not <math>O(n\times m)</math> since
** The operation is performed element-wise, so it actually never materializes <math>\hat{V_t}</math> as a <math>n\times n</math> matrix
** It also only storing <math>R_t</math>and <math> C_t</math> instead of storage the full second-moment matrix

== Numerical Examples ==
Step-by-step instructions for determining the result of the first iteration.

=== Problem setup ===
'''Minimize the loss function:'''

<math>f(X) = \frac{1}{2}\sum_{i,j}(X_{ij}-C_{ij})^2</math>

'''Initial weights ('''<math>X_0</math>'''):'''

<math>X_0 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}</math>

'''Target matrix (<math>C</math>)：'''

<math>C = \begin{bmatrix} 0.4 & -0.3 &0.5 \\ -0.6 & 0.2&-1.5\\1.0&-0.3&0.1 \end{bmatrix}</math>

=== Hyperparameters setup ===
<math>\epsilon_1 = 10^{-30}</math> (Minimum learning rate scaling factor))

<math>\epsilon_2 = 10^{-3}</math> (Regularization constant)

<math>d = 1</math> (Clipping threshold)

<math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> (Relative step size)

<math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> (Second moment decay)

=== Step 1: Learning Rate Scaling ===
Define the relative step size

<math>\rho_1 = \min(10^{-2}, 1/\sqrt{1})= 10^{-2}</math>

'''Step 1.1: Root Mean Square(RMS) calculation for <math>X_0</math>'''

Root Mean Square(RMS) calculation for <math>X_0</math>

RMS formula

<math>RMS(X_0) = \sqrt{\tfrac{1}{n}\sum_{i=1}^n X_0[i]^2}</math>

Substitute the initial weights

<math>RMS(X_0) = \sqrt{\tfrac{1}{9}(0.7^2+(-0.5)^2+0.9^2+(-1.1)^2+0.8^2+(-0.6)^2+1.2^2+(-0.7)^2+0.4^2)}</math>

<math>RMS(X_0) = \sqrt{\frac{6.85}{9}}\approx 0.806</math>

'''Step 1.2: Find the Learning Rate Scaling ('''<math>\alpha_t</math>'''):'''

Learning rate formula

<math>\alpha_1 = max(\epsilon_2,RMS(X_0))\cdot p_1</math>

Substitute the RMS

<math>\alpha_1 = max(0.001,0.806)\cdot 0.01=0.00806</math>

=== Step 2: Compute <math>G^{2}_t</math> (Element-wise Square of Gradient) ===
'''Step 2.1: Compute the gradient of the loss function'''

Gradient formula

<math>G_t = {\partial f(X)\over\partial X} = X_{t-1} - C</math>

Subtract C from <math>X_0</math>

<math>G_1 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix} - \begin{bmatrix} 0.4 & -0.3 &0.5 \\ -0.6 & 0.2&-1.5\\1.0&-0.3&0.1 \end{bmatrix}</math>

<math>G_1 = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math>

'''Step 2.2: Compute the squared value of each element in the gradient matrix <math>G_t</math>'''

<math>G^{2}_1 = \begin{bmatrix} 0.3^2&(-0.2)^2&0.4^2\\ (-0.5)^2&0.6^2&(-0.1)^2\\0.2^2&(-0.4)^2 &0.3^2 \end{bmatrix}</math>

<math>G^{2}_1 = \begin{bmatrix} 0.09& 0.04&0.16\\ 0.25&0.36&0.01\\0.04&0.16&0.09\end{bmatrix}</math>

=== Step 3: Find the moment estimate ===
Compute the exponential moving average of squared gradients to capture the variance or scale of gradients.

'''Step 3.1: Compute row moments (<math>R_t</math>)'''

This equation computes the row-wise second moments ('''<math>R_t</math>''' ) as an exponential moving average of past moments ('''<math>R_{t-1}</math>''') and the current row-wise mean of squared gradients ( <math>G^{2}_t</math> ), with a balance controlled by (<math>\hat{\beta}_{2t}</math>).

For <math>G^{2}_t=\mathbb{R}^{m\times n} </math>

<math>R_t = \hat{\beta_{2t}} \cdot R_{t-1} + (1-\hat{\beta})\cdot (\tfrac{1}{m}\sum_{j=1}^m G^{2}_t[i,j]+\epsilon_1) </math>

Since <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>, for first iteration: <math>\hat{\beta}_{21} = 0</math>. And because <math>\epsilon_1 </math> is too small, we can ignore it. The update of '''<math>R_t</math>''' is:

<math>R_{1} = \tfrac{1}{m}\textstyle \sum_{j=1}^m \displaystyle G^{2}_1[i,j] </math>

Row-wise mean ('''<math>R_t</math>'''):

<math>R_1 = \begin{bmatrix} \tfrac{0.09+0.04+0.16}{3} \\ \tfrac{0.25+0.36+0.01}{3}\\\tfrac{0.04+0.16+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.0967\\ 0.2067\\0.0967\end{bmatrix} </math>

'''Step 3.2: Compute column moments (<math>C_t</math>)'''

The process is same as row moments.

<math>C_t = \hat{\beta}\cdot C_{{t-1}} + (1-\hat{\beta})\cdot (\tfrac{1}{n}\sum_{j=1}^n G^{2}_t[i,j]+\epsilon_1) </math>

Column-wise mean (<math>C_t</math>):

<math>C_1 = \begin{bmatrix} \tfrac{0.09+025+0.04}{3} \\ \tfrac{0.04+0.36+0.16}{3}\\\tfrac{0.16+0.01+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.1267\\ 0.1867\\0.0867\end{bmatrix} </math>

'''Step 3.3: Second Moment Estimate ('''<math>\hat{V_t}</math>''')'''

The Second Moment Estimate is calculated as the outer product of the row moments ('''<math>R_t</math>''') and column moments ('''<math>C_t</math>''').

<math>\hat{V}_t = R_t \otimes C_t</math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0967\\0.2067\\0.0967 \end{bmatrix} \otimes \begin{bmatrix} 0.1267&0.1867&0.0867\\ \end{bmatrix} </math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\ 0.0122&0.0180&0.0084\end{bmatrix} </math>

=== Step 4: Update the vector (<math>U_t </math>) ===
Computed by scaling the gradient matrix '''<math>G_t</math>''' element-wise with the inverse square root of the second moment estimate (<math>\hat{V_t}</math>)

'''Step 4.1: Find the vector value of <math>U_t </math>'''

Formula of '''<math>U_t </math>'''

<math>U_t = \frac{G_t}{\sqrt{\hat{V_t}+\epsilon_1}} </math>

Substitute '''<math>C_t</math>''' and <math>V_t</math>

<math>U_1 = \frac{\begin{bmatrix}0.3&-0.2&0.4 \\ -0.5&0.6&-0.1\\0.2&-0.4&0.3 \end{bmatrix}}{\sqrt{\begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\0.0122&0.0180&0.0084 \end{bmatrix}}} </math>

<math>U_1 = \begin{bmatrix} 2.711&-1.489&4.370\\-3.090&3.055&-0.747\\1.807&-2.978&3.278 \end{bmatrix} </math>

'''Step 4.2: Clipped Update Vector <math>\hat{U_t} </math>'''

Scale the update vector ( '''<math>U_t </math>''' ) to ensure its RMS value does not exceed a predefined clipping threshold (<math>d </math>), maintaining stability in updates.

Formula of '''<math>\hat{U_t} </math>'''

'''<math>\hat{U_t} = \frac{U_t}{max(1,\tfrac{RMS(U_t)}{d}) } </math>'''

Compute RMS of '''<math>U_t </math>'''

'''<math>RMS(U_1) = \sqrt{\tfrac{1}{9} \sum_{i=1}^9 U_t[i]^2} \approx 3.303 </math>'''

Since RMS('''<math>U_t </math>''')>d, scale '''<math>U_t </math>''' by <math>\tfrac{1}{3.303} </math>

'''<math>\hat{U_1} = \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>'''

=== Step 5: Weight Update (<math>X_1 </math>) ===
Adjust the weights (<math>X_t </math>) by subtracting the product of the learning rate (<math>\alpha_t </math>) and the clipped update vector (<math>\hat{U_t} </math> ).

<math>X_1 = X_0 - \alpha \cdot \hat{U_t}</math>

The result for first iteration.

<math>X_1 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix} - 0.00806 \cdot \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>

<math>X_1 = \begin{bmatrix} 0.692&-0.496&0.887 \\-1.091&0.791&-0.596\\ 1.195&-0.691&0.391\end{bmatrix} </math>

== Applications ==
Adafactor is an efficient adaptive optimizer designed specifically for large-scale deep learning tasks. Its unique memory-saving properties have made it widely used for training large-scale language models, image recognition models, and reinforcement learning policy networks. Compared to other optimizers (e.g., Adam), Adafactor delivers exceptional performance in large-scale computations while significantly reducing memory requirements. Below are several specific application scenarios of Adafactor:

'''1. Natural Language Processing (NLP)'''

In NLP tasks, Adafactor has been successfully applied to training ultra-large-scale language models, such as Google’s Transformer and T5 (Text-To-Text Transfer Transformer). By significantly reducing memory usage during the gradient update process, Adafactor enables efficient model training in resource-constrained environments. For example, the T5 model in Google’s research employed Adafactor to effectively train on large datasets through text-to-text conversion tasks.2

'''2. Training Large-Scale Language Models'''

Adafactor has been used to train large-scale language models like LLaMA, combining it with novel preconditioned diagonalization methods to significantly enhance training efficiency. Experiments showed that Adafactor achieved performance comparable to the Adam optimizer while consuming substantially less memory and computational resources.3

'''3. Humor Detection Tasks'''

Adafactor has been utilized to optimize ALBERT-based models for humor detection tasks. Configured as an adaptive learning rate optimizer and paired with a cross-entropy loss function, Adafactor was used to train models that achieved 99% accuracy and F1 scores. Moreover, training time was faster than with Adam, completing in approximately 43 minutes. Comparisons with Adam and AdaBound optimizers demonstrated that Adafactor excelled in terms of both time efficiency and performance, especially in accuracy, recall, and F1 scores for humor detection tasks .4

'''4. Multilingual Model Training'''

In training multilingual models, Adafactor improved scalability and efficiency, particularly by significantly reducing memory consumption when handling large-scale parameters.5

'''5. Pretraining Vision Models'''

When training ResNet50 and ViT on the ImageNet1k dataset, Adafactor successfully optimized these deep networks with its low memory requirements. Additionally, with new algorithms combining preconditioned diagonalization methods (e.g., AdafacDiag and AdafacDiag++), it outperformed the standard Adam optimizer in both convergence speed and final accuracy.6

==== '''Software Tools and Platforms''' ====
Adafactor has been integrated into the following mainstream deep learning frameworks, making it accessible to developers:

'''TensorFlow''': Provides a built-in implementation of Adafactor.7

'''PyTorch:''' PyTorch provides the Adafactor optimizer through the torch.optim.AdaFactor class.8

'''JAX/Flax:''' JAX provides an optimizer library called Optax, which includes the Adafactor optimizer.9

==== '''Future Prospects''' ====
As the scale of deep learning models continues to grow, Adafactor’s memory-saving and computational efficiency advantages will become increasingly important. In the training of ultra-large-scale models (e.g., GPT and Vision Transformers), Adafactor is expected to become an indispensable optimization tool. Furthermore, by combining with other optimization strategies, such as mixed precision training, Adafactor may further enhance its applicability in both industrial and research settings.

== Conclusion ==
Adafactor addresses the memory consumption challenge of training large-scale deep learning models. By factorizing the second-order moment matrix and dynamically adjusting the learning rate, Adafactor minimizes resource usage without compromising performance. Adafactor can be applied to the training tasks of large language models such as Transformers, T5 models, and Vision Transformers.

== Reference ==

# Shazeer, Noam, and Mitchell Stern. "Adafactor: Adaptive learning rates with sublinear memory cost." ''International Conference on Machine Learning''. PMLR, 2018.
# Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." ''Journal of machine learning research'' 21.140 (2020): 1-67.
# DIAGONALIZATION, VIA PRECONDITIONER. "Improving Adaptive Moment Optimization via Preconditioner Diagonalization."
# Chauhan, Tavishee, and Hemant Palivela. "The Fine tuning of Language models for automation of Humor Detection." ''INFOCOMP Journal of Computer Science'' 20.2 (2021).
# Lepikhin, Dmitry, et al. "Gshard: Scaling giant models with conditional computation and automatic sharding." ''arXiv preprint arXiv:2006.16668'' (2020).
# DIAGONALIZATION, VIA PRECONDITIONER. "Improving Adaptive Moment Optimization via Preconditioner Diagonalization."
# <nowiki>https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adafactor</nowiki>
# <nowiki>https://pytorch.org/docs/stable/generated/torch.optim.Adafactor.html</nowiki>
# <nowiki>https://flax.readthedocs.io/en/v0.5.3/_autosummary/flax.optim.Adafactor.html</nowiki>

Adafactor

2024-12-15T03:12:42Z

Fall2024 Wiki Team6: /* Numerical Examples */

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

== Introduction ==
Adafactor is an efficient, adaptive learning rate optimization algorithm proposed by Noam Shazeer and Mitchell Stern in 2018. 1

Unlike traditional Adam optimizers, Adafactor does not store complete second-order moment matrices. Instead, it employs a factorization approach that only maintains gradient statistics for the rows and columns of parameter matrices, significantly reducing memory usage. Moreover, Adafactor uses an adaptive learning rate, allowing it to dynamically adjust step sizes without the need for manually setting a global learning rate or relying heavily on hyperparameter tuning. Its design also defaults to not performing bias correction, yet it remains stable in scenarios involving large-batch training data.1 This efficiency makes it an ideal choice for training ultra-large-scale models such as T5.2

Adafactor’s efficient memory usage and outstanding performance make it widely applicable in scenarios such as Natural Language Processing (NLP).2 Compared to the Adam optimizer, Adafactor significantly reduces memory and computational resource requirements while maintaining comparable performance when training large-scale language models and vision models. 3,6

== Problem formulation ==
=== 1. Objective ===
Minimize the loss function <math>f(x)</math>, where <math>x \in R^n</math> and <math>x</math> is the weight vector to be optimized.

=== 2. Parameters ===
*''' Gradient:'''
<math>G_t = \nabla f(x_{t-1})</math>

* '''Second moment estimate:'''

<math> \hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>

* '''Where:'''
** <math>\hat{V}_t</math> is the running average of the squared gradient.
**<math>\hat{\beta}_{2t}</math> is the corrected decay parameter.
**<math>\epsilon_1</math> is a regularization constant.

* '''Step size:'''
<math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math>
* '''Where''':
** <math>\rho_t</math> is the relative step size.
** <math>\epsilon_2</math> is a regularization constant.
** <math>\text{RMS}</math> is the root mean square, defined as:
*** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
*** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>

=== 3. Algorithms ===
==== Adafactor for Weighted Vectors ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^n</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update second moment estimate: <math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

==== Adafactor for Weighted Matrices ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^{n \times m}</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update row-wise second moment: <math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
** Update column-wise second moment: <math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
** Update overall second moment estimate: <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

=== Proposed Hyperparameters for Adafactor ===

* '''Regularization constant 1 (<math>\epsilon_1</math>):''' <math>10^{-30}</math>
**Ensures numerical stability by preventing division by zero in the calculation of second-moment estimates. This value is set extremely low to avoid instability in calculations.

* '''Regularization constant 2 (<math>\epsilon_2</math>):''' <math>10^{-3}</math>
**Helps stabilize parameter updates by controlling the scaling effect of second-moments in low-magnitude scenarios. This prevents instability caused by noise in small gradients.

* '''Clipping threshold (<math>d</math>):''' <math>1</math>
**A clipping threshold of 1 ensures stability by limiting large gradient values while maintaining sufficient learning efficiency. This avoids excessive suppression of large gradients, which could hinder learning.

* '''Relative step size (<math>\rho_t</math>):''' <math>\min(10^{-2}, 1 / \sqrt{t})</math>
**The <math>\min(10^{-2}, ...)</math> term caps the learning rate at <math>10^{-2}</math>, an empirically determined upper bound.
**The <math>1 / \sqrt{t}</math> term ensures convergence by reducing the step size over time, balancing exploration during early iterations with stability later in training.

* '''Second moment decay (<math>\hat{\beta}_{2t}</math>):''' <math>1 - t^{-0.8}</math>
**The decay factor remains close to 1 initially to allow rapid adaptation.
**The <math>t^{-0.8}</math> power balances between rapid learning in early training and stability during later stages, ensuring smoother convergence.

=== 5. Discussion ===

==== Why Clipping ====
Adafactor employs clipping to maintain numerical stability, especially since it is designed for use with very large models and often works with unscaled learning rates.
* Clipping prevents the update step from becoming very large, which would destabilize training
* Clipping mitigates the effects of very large gradients preventing numerical instability
Therefore, implementing clipping helps ensure stability and efficient training without requiring per-parameter scaling like Adam.

==== Why Adafactor is more memory efficient, compared to Adam ====
'''Row-wise and Column-wise Second Moment Updates'''
*<math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
*<math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
Instead of storing the full <math>G_t^2</math>, Adafactor computes the row and column respectively, which reduces the memory requirements from <math>O(n\times m)</math> to <math>O(n + m)</math>

'''Factored Representation of the Second Moment'''
* <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
This updates the second momentum based on the outer product <math>R_t C_t</math>.
*However, this is not <math>O(n\times m)</math> since
** The operation is performed element-wise, so it actually never materializes <math>\hat{V_t}</math> as a <math>n\times n</math> matrix
** It also only storing <math>R_t</math>and <math> C_t</math> instead of storage the full second-moment matrix

== Numerical Examples ==
Step-by-step instructions for determining the result of the first iteration.

=== Problem setup ===
'''Initial weights ('''<math>X_0</math>'''):'''

<math>X_0 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}</math>

'''Gradient for first iteration (<math>G_1</math>):'''

Gradient of the loss function with respect to X

<math>G_1 = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math>

=== Hyperparameters setup ===
<math>\epsilon_1 = 10^{-30}</math> (Minimum learning rate scaling factor))

<math>\epsilon_2 = 10^{-3}</math> (Regularization constant)

<math>d = 1</math> (Clipping threshold)

<math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> (Relative step size)

<math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> (Second moment decay)

=== Step 1: Learning Rate Scaling ===
Define the relative step size

<math>\rho_1 = \min(10^{-2}, 1/\sqrt{1})= 10^{-2}</math>

'''Step 1.1: Root Mean Square(RMS) calculation for <math>X_0</math>'''

Root Mean Square(RMS) calculation for <math>X_0</math>

RMS formula

<math>RMS(X_0) = \sqrt{\tfrac{1}{n}\sum_{i=1}^n X_0[i]^2}</math>

Substitute the initial weights

<math>RMS(X_0) = \sqrt{\tfrac{1}{9}(0.7^2+(-0.5)^2+0.9^2+(-1.1)^2+0.8^2+(-0.6)^2+1.2^2+(-0.7)^2+0.4^2)}</math>

<math>RMS(X_0) = \sqrt{\frac{6.85}{9}}\approx 0.806</math>

'''Step 1.2: Find the Learning Rate Scaling ('''<math>\alpha_t</math>'''):'''

Learning rate formula

<math>\alpha_1 = max(\epsilon_2,RMS(X_0))\cdot p_1</math>

Substitute the RMS

<math>\alpha_1 = max(0.001,0.806)\cdot 0.01=0.00806</math>

=== Step 2: Compute <math>G^{2}_t</math> (Element-wise Square of Gradient) ===
Compute the squared value of each element in the gradient matrix '''<math>G_t</math>'''.

<math>G^{2}_1 = \begin{bmatrix} 0.3^2&(-0.2)^2&0.4^2\\ (-0.5)^2&0.6^2&(-0.1)^2\\0.2^2&(-0.4)^2 &0.3^2 \end{bmatrix}</math>

<math>G^{2}_1 = \begin{bmatrix} 0.09& 0.04&0.16\\ 0.25&0.36&0.01\\0.04&0.16&0.09\end{bmatrix}</math>

=== Step 3: Find the moment estimate ===
Compute the exponential moving average of squared gradients to capture the variance or scale of gradients.

'''Step 3.1: Compute row moments (<math>R_t</math>)'''

This equation computes the row-wise second moments ('''<math>R_t</math>''' ) as an exponential moving average of past moments ('''<math>R_{t-1}</math>''') and the current row-wise mean of squared gradients ( <math>G^{2}_t</math> ), with a balance controlled by (<math>\hat{\beta}_{2t}</math>).

For <math>G^{2}_t=\mathbb{R}^{m\times n} </math>

<math>R_t = \hat{\beta_{2t}} \cdot R_{t-1} + (1-\hat{\beta})\cdot (\tfrac{1}{m}\sum_{j=1}^m G^{2}_t[i,j]+\epsilon_1) </math>

Since <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>, for first iteration: <math>\hat{\beta}_{21} = 0</math>. And because <math>\epsilon_1 </math> is too small, we can ignore it. The update of '''<math>R_t</math>''' is:

<math>R_{1} = \tfrac{1}{m}\textstyle \sum_{j=1}^m \displaystyle G^{2}_1[i,j] </math>

Row-wise mean ('''<math>R_t</math>'''):

<math>R_1 = \begin{bmatrix} \tfrac{0.09+0.04+0.16}{3} \\ \tfrac{0.25+0.36+0.01}{3}\\\tfrac{0.04+0.16+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.0967\\ 0.2067\\0.0967\end{bmatrix} </math>

'''Step 3.2: Compute column moments (<math>C_t</math>)'''

The process is same as row moments.

<math>C_t = \hat{\beta}\cdot C_{{t-1}} + (1-\hat{\beta})\cdot (\tfrac{1}{n}\sum_{j=1}^n G^{2}_t[i,j]+\epsilon_1) </math>

Column-wise mean (<math>C_t</math>):

<math>C_1 = \begin{bmatrix} \tfrac{0.09+025+0.04}{3} \\ \tfrac{0.04+0.36+0.16}{3}\\\tfrac{0.16+0.01+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.1267\\ 0.1867\\0.0867\end{bmatrix} </math>

'''Step 3.3: Second Moment Estimate ('''<math>\hat{V_t}</math>''')'''

The Second Moment Estimate is calculated as the outer product of the row moments ('''<math>R_t</math>''') and column moments ('''<math>C_t</math>''').

<math>\hat{V}_t = R_t \otimes C_t</math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0967\\0.2067\\0.0967 \end{bmatrix} \otimes \begin{bmatrix} 0.1267&0.1867&0.0867\\ \end{bmatrix} </math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\ 0.0122&0.0180&0.0084\end{bmatrix} </math>

=== Step 4: Update the vector (<math>U_t </math>) ===
Computed by scaling the gradient matrix '''<math>G_t</math>''' element-wise with the inverse square root of the second moment estimate (<math>\hat{V_t}</math>)

'''Step 4.1: Find the vector value of <math>U_t </math>'''

Formula of '''<math>U_t </math>'''

<math>U_t = \frac{G_t}{\sqrt{\hat{V_t}+\epsilon_1}} </math>

Substitute '''<math>C_t</math>''' and <math>V_t</math>

<math>U_1 = \frac{\begin{bmatrix}0.3&-0.2&0.4 \\ -0.5&0.6&-0.1\\0.2&-0.4&0.3 \end{bmatrix}}{\sqrt{\begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\0.0122&0.0180&0.0084 \end{bmatrix}}} </math>

<math>U_1 = \begin{bmatrix} 2.711&-1.489&4.370\\-3.090&3.055&-0.747\\1.807&-2.978&3.278 \end{bmatrix} </math>

'''Step 4.2: Clipped Update Vector <math>\hat{U_t} </math>'''

Scale the update vector ( '''<math>U_t </math>''' ) to ensure its RMS value does not exceed a predefined clipping threshold (<math>d </math>), maintaining stability in updates.

Formula of '''<math>\hat{U_t} </math>'''

'''<math>\hat{U_t} = \frac{U_t}{max(1,\tfrac{RMS(U_t)}{d}) } </math>'''

Compute RMS of '''<math>U_t </math>'''

'''<math>RMS(U_1) = \sqrt{\tfrac{1}{9} \sum_{i=1}^9 U_t[i]^2} \approx 3.303 </math>'''

Since RMS('''<math>U_t </math>''')>d, scale '''<math>U_t </math>''' by <math>\tfrac{1}{3.303} </math>

'''<math>\hat{U_1} = \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>'''

=== Step 5: Weight Update (<math>X_1 </math>) ===
Adjust the weights (<math>X_t </math>) by subtracting the product of the learning rate (<math>\alpha_t </math>) and the clipped update vector (<math>\hat{U_t} </math> ).

<math>X_1 = X_0 - \alpha \cdot \hat{U_t}</math>

The result for first iteration.

<math>X_1 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix} - 0.00806 \cdot \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>

<math>X_1 = \begin{bmatrix} 0.692&-0.496&0.887 \\-1.091&0.791&-0.596\\ 1.195&-0.691&0.391\end{bmatrix} </math>

== Applications ==
Adafactor is an efficient adaptive optimizer designed specifically for large-scale deep learning tasks. Its unique memory-saving properties have made it widely used for training large-scale language models, image recognition models, and reinforcement learning policy networks. Compared to other optimizers (e.g., Adam), Adafactor delivers exceptional performance in large-scale computations while significantly reducing memory requirements. Below are several specific application scenarios of Adafactor:

'''1. Natural Language Processing (NLP)'''

In NLP tasks, Adafactor has been successfully applied to training ultra-large-scale language models, such as Google’s Transformer and T5 (Text-To-Text Transfer Transformer). By significantly reducing memory usage during the gradient update process, Adafactor enables efficient model training in resource-constrained environments. For example, the T5 model in Google’s research employed Adafactor to effectively train on large datasets through text-to-text conversion tasks.2

'''2. Training Large-Scale Language Models'''

Adafactor has been used to train large-scale language models like LLaMA, combining it with novel preconditioned diagonalization methods to significantly enhance training efficiency. Experiments showed that Adafactor achieved performance comparable to the Adam optimizer while consuming substantially less memory and computational resources.3

'''3. Humor Detection Tasks'''

Adafactor has been utilized to optimize ALBERT-based models for humor detection tasks. Configured as an adaptive learning rate optimizer and paired with a cross-entropy loss function, Adafactor was used to train models that achieved 99% accuracy and F1 scores. Moreover, training time was faster than with Adam, completing in approximately 43 minutes. Comparisons with Adam and AdaBound optimizers demonstrated that Adafactor excelled in terms of both time efficiency and performance, especially in accuracy, recall, and F1 scores for humor detection tasks .4

'''4. Multilingual Model Training'''

In training multilingual models, Adafactor improved scalability and efficiency, particularly by significantly reducing memory consumption when handling large-scale parameters.5

'''5. Pretraining Vision Models'''

When training ResNet50 and ViT on the ImageNet1k dataset, Adafactor successfully optimized these deep networks with its low memory requirements. Additionally, with new algorithms combining preconditioned diagonalization methods (e.g., AdafacDiag and AdafacDiag++), it outperformed the standard Adam optimizer in both convergence speed and final accuracy.6

==== '''Software Tools and Platforms''' ====
Adafactor has been integrated into the following mainstream deep learning frameworks, making it accessible to developers:

'''TensorFlow''': Provides a built-in implementation of Adafactor.7

'''PyTorch:''' PyTorch provides the Adafactor optimizer through the torch.optim.AdaFactor class.8

'''JAX/Flax:''' JAX provides an optimizer library called Optax, which includes the Adafactor optimizer.9

==== '''Future Prospects''' ====
As the scale of deep learning models continues to grow, Adafactor’s memory-saving and computational efficiency advantages will become increasingly important. In the training of ultra-large-scale models (e.g., GPT and Vision Transformers), Adafactor is expected to become an indispensable optimization tool. Furthermore, by combining with other optimization strategies, such as mixed precision training, Adafactor may further enhance its applicability in both industrial and research settings.

== Conclusion ==
Adafactor addresses the memory consumption challenge of training large-scale deep learning models. By factorizing the second-order moment matrix and dynamically adjusting the learning rate, Adafactor minimizes resource usage without compromising performance. Adafactor can be applied to the training tasks of large language models such as Transformers, T5 models, and Vision Transformers.

== Reference ==

# Shazeer, Noam, and Mitchell Stern. "Adafactor: Adaptive learning rates with sublinear memory cost." ''International Conference on Machine Learning''. PMLR, 2018.
# Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." ''Journal of machine learning research'' 21.140 (2020): 1-67.
# DIAGONALIZATION, VIA PRECONDITIONER. "Improving Adaptive Moment Optimization via Preconditioner Diagonalization."
# Chauhan, Tavishee, and Hemant Palivela. "The Fine tuning of Language models for automation of Humor Detection." ''INFOCOMP Journal of Computer Science'' 20.2 (2021).
# Lepikhin, Dmitry, et al. "Gshard: Scaling giant models with conditional computation and automatic sharding." ''arXiv preprint arXiv:2006.16668'' (2020).
# DIAGONALIZATION, VIA PRECONDITIONER. "Improving Adaptive Moment Optimization via Preconditioner Diagonalization."
# <nowiki>https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adafactor</nowiki>
# <nowiki>https://pytorch.org/docs/stable/generated/torch.optim.Adafactor.html</nowiki>
# <nowiki>https://flax.readthedocs.io/en/v0.5.3/_autosummary/flax.optim.Adafactor.html</nowiki>

Adafactor

2024-12-15T03:06:59Z

Fall2024 Wiki Team6:

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

== Introduction ==
Adafactor is an efficient, adaptive learning rate optimization algorithm proposed by Noam Shazeer and Mitchell Stern in 2018. 1

Unlike traditional Adam optimizers, Adafactor does not store complete second-order moment matrices. Instead, it employs a factorization approach that only maintains gradient statistics for the rows and columns of parameter matrices, significantly reducing memory usage. Moreover, Adafactor uses an adaptive learning rate, allowing it to dynamically adjust step sizes without the need for manually setting a global learning rate or relying heavily on hyperparameter tuning. Its design also defaults to not performing bias correction, yet it remains stable in scenarios involving large-batch training data.1 This efficiency makes it an ideal choice for training ultra-large-scale models such as T5.2

Adafactor’s efficient memory usage and outstanding performance make it widely applicable in scenarios such as Natural Language Processing (NLP).2 Compared to the Adam optimizer, Adafactor significantly reduces memory and computational resource requirements while maintaining comparable performance when training large-scale language models and vision models. 3,6

== Problem formulation ==
=== 1. Objective ===
Minimize the loss function <math>f(x)</math>, where <math>x \in R^n</math> and <math>x</math> is the weight vector to be optimized.

=== 2. Parameters ===
*''' Gradient:'''
<math>G_t = \nabla f(x_{t-1})</math>

* '''Second moment estimate:'''

<math> \hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>

* '''Where:'''
** <math>\hat{V}_t</math> is the running average of the squared gradient.
**<math>\hat{\beta}_{2t}</math> is the corrected decay parameter.
**<math>\epsilon_1</math> is a regularization constant.

* '''Step size:'''
<math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math>
* '''Where''':
** <math>\rho_t</math> is the relative step size.
** <math>\epsilon_2</math> is a regularization constant.
** <math>\text{RMS}</math> is the root mean square, defined as:
*** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
*** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>

=== 3. Algorithms ===
==== Adafactor for Weighted Vectors ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^n</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update second moment estimate: <math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

==== Adafactor for Weighted Matrices ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^{n \times m}</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update row-wise second moment: <math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
** Update column-wise second moment: <math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
** Update overall second moment estimate: <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

=== Proposed Hyperparameters for Adafactor ===

* '''Regularization constant 1 (<math>\epsilon_1</math>):''' <math>10^{-30}</math>
**Ensures numerical stability by preventing division by zero in the calculation of second-moment estimates. This value is set extremely low to avoid instability in calculations.

* '''Regularization constant 2 (<math>\epsilon_2</math>):''' <math>10^{-3}</math>
**Helps stabilize parameter updates by controlling the scaling effect of second-moments in low-magnitude scenarios. This prevents instability caused by noise in small gradients.

* '''Clipping threshold (<math>d</math>):''' <math>1</math>
**A clipping threshold of 1 ensures stability by limiting large gradient values while maintaining sufficient learning efficiency. This avoids excessive suppression of large gradients, which could hinder learning.

* '''Relative step size (<math>\rho_t</math>):''' <math>\min(10^{-2}, 1 / \sqrt{t})</math>
**The <math>\min(10^{-2}, ...)</math> term caps the learning rate at <math>10^{-2}</math>, an empirically determined upper bound.
**The <math>1 / \sqrt{t}</math> term ensures convergence by reducing the step size over time, balancing exploration during early iterations with stability later in training.

* '''Second moment decay (<math>\hat{\beta}_{2t}</math>):''' <math>1 - t^{-0.8}</math>
**The decay factor remains close to 1 initially to allow rapid adaptation.
**The <math>t^{-0.8}</math> power balances between rapid learning in early training and stability during later stages, ensuring smoother convergence.

=== 5. Discussion ===

==== Why Clipping ====
Adafactor employs clipping to maintain numerical stability, especially since it is designed for use with very large models and often works with unscaled learning rates.
* Clipping prevents the update step from becoming very large, which would destabilize training
* Clipping mitigates the effects of very large gradients preventing numerical instability
Therefore, implementing clipping helps ensure stability and efficient training without requiring per-parameter scaling like Adam.

==== Why Adafactor is more memory efficient, compared to Adam ====
'''Row-wise and Column-wise Second Moment Updates'''
*<math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
*<math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
Instead of storing the full <math>G_t^2</math>, Adafactor computes the row and column respectively, which reduces the memory requirements from <math>O(n\times m)</math> to <math>O(n + m)</math>

'''Factored Representation of the Second Moment'''
* <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
This updates the second momentum based on the outer product <math>R_t C_t</math>.
*However, this is not <math>O(n\times m)</math> since
** The operation is performed element-wise, so it actually never materializes <math>\hat{V_t}</math> as a <math>n\times n</math> matrix
** It also only storing <math>R_t</math>and <math> C_t</math> instead of storage the full second-moment matrix

== Numerical Examples ==
Step-by-step instructions for determining the result of the first iteration.

=== Problem setup ===
'''Initial weights ('''<math>X_0</math>'''):'''

<math>X_0 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}</math>

'''Gradient for first iteration (<math>G_1</math>):'''

Gradient of the loss function with respect to X

<math>G_1 = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math>

=== Hyperparameters setup ===
<math>\epsilon_1 = 10^{-30}</math> (Minimum learning rate scaling factor))

<math>\epsilon_2 = 10^{-3}</math> (Regularization constant)

<math>d = 1</math> (Clipping threshold)

<math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> (Relative step size)

<math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> (Second moment decay)

=== Step 1: Learning Rate Scaling ===
Define the relative step size

<math>\rho_1 = \min(10^{-2}, 1/\sqrt{1})= 10^{-2}</math>

'''Step 1.1: Root Mean Square(RMS) calculation for <math>X_0</math>'''

Root Mean Square(RMS) calculation for <math>X_0</math>

RMS formula

<math>RMS(X_0) = \sqrt{\tfrac{1}{n}\sum_{i=1}^n X_0[i]^2}</math>

Substitute the initial weights

<math>RMS(X_0) = \sqrt{\tfrac{1}{9}(0.72^2+(-0.5)^2+0.9^2+(-1.1)^2+0.8^2+(-0.6)^2+1.2^2+(-0.7)^2+0.4^2)}</math>

<math>RMS(X_0) = \sqrt{\frac{6.85}{9}}\approx 0.806</math>

'''Step 1.2: Find the Learning Rate Scaling ('''<math>\alpha_t</math>'''):'''

Learning rate formula

<math>\alpha_1 = max(\epsilon_2,RMS(X_0))\cdot p_1</math>

Substitute the RMS

<math>\alpha_1 = max(0.001,0.806)\cdot 0.01=0.00806</math>

=== Step 2: Compute <math>G^{2}_t</math> (Element-wise Square of Gradient) ===
Compute the squared value of each element in the gradient matrix '''<math>G_t</math>'''.

<math>G^{2}_1 = \begin{bmatrix} 0.3^2&(-0.2)^2&0.4^2\\ (-0.5)^2&0.6^2&(-0.1)^2\\0.2^2&(-0.4)^2 &0.3^2 \end{bmatrix}</math>

<math>G^{2}_1 = \begin{bmatrix} 0.09& 0.04&0.16\\ 0.25&0.36&0.01\\0.04&0.16&0.09\end{bmatrix}</math>

=== Step 3: Find the moment estimate ===
Compute the exponential moving average of squared gradients to capture the variance or scale of gradients.

'''Step 3.1: Compute row moments (<math>R_t</math>)'''

This equation computes the row-wise second moments ('''<math>R_t</math>''' ) as an exponential moving average of past moments ('''<math>R_{t-1}</math>''') and the current row-wise mean of squared gradients ( <math>G^{2}_t</math> ), with a balance controlled by (<math>\hat{\beta}_{2t}</math>).

For <math>G^{2}_t=\mathbb{R}^{m\times n} </math>

<math>R_t = \hat{\beta_{2t}} \cdot R_{t-1} + (1-\hat{\beta})\cdot (\tfrac{1}{m}\sum_{j=1}^m G^{2}_t[i,j]+\epsilon_1) </math>

Since <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>, for first iteration: <math>\hat{\beta}_{21} = 0</math>. And because <math>\epsilon_1 </math> is too small, we can ignore it. The update of '''<math>R_t</math>''' is:

<math>R_{1} = \tfrac{1}{m}\textstyle \sum_{j=1}^m \displaystyle G^{2}_1[i,j] </math>

Row-wise mean ('''<math>R_t</math>'''):

<math>R_1 = \begin{bmatrix} \tfrac{0.09+0.04+0.16}{3} \\ \tfrac{0.25+0.36+0.01}{3}\\\tfrac{0.04+0.16+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.0967\\ 0.2067\\0.0967\end{bmatrix} </math>

'''Step 3.2: Compute column moments (<math>C_t</math>)'''

The process is same as row moments.

<math>C_t = \hat{\beta}\cdot C_{{t-1}} + (1-\hat{\beta})\cdot (\tfrac{1}{n}\sum_{j=1}^n G^{2}_t[i,j]+\epsilon_1) </math>

Column-wise mean (<math>C_t</math>):

<math>C_1 = \begin{bmatrix} \tfrac{0.09+025+0.04}{3} \\ \tfrac{0.04+0.36+0.16}{3}\\\tfrac{0.16+0.01+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.1267\\ 0.1867\\0.0867\end{bmatrix} </math>

'''Step 3.3: Second Moment Estimate ('''<math>\hat{V_t}</math>''')'''

The Second Moment Estimate is calculated as the outer product of the row moments ('''<math>R_t</math>''') and column moments ('''<math>C_t</math>''').

<math>\hat{V}_t = R_t \otimes C_t</math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0967\\0.2067\\0.0967 \end{bmatrix} \otimes \begin{bmatrix} 0.1267&0.1867&0.0867\\ \end{bmatrix} </math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\ 0.0122&0.0180&0.0084\end{bmatrix} </math>

=== Step 4: Update the vector (<math>U_t </math>) ===
Computed by scaling the gradient matrix '''<math>G_t</math>''' element-wise with the inverse square root of the second moment estimate (<math>\hat{V_t}</math>)

'''Step 4.1: Find the vector value of <math>U_t </math>'''

Formula of '''<math>U_t </math>'''

<math>U_t = \frac{G_t}{\sqrt{\hat{V_t}+\epsilon_1}} </math>

Substitute '''<math>C_t</math>''' and <math>V_t</math>

<math>U_1 = \frac{\begin{bmatrix}0.3&-0.2&0.4 \\ -0.5&0.6&-0.1\\0.2&-0.4&0.3 \end{bmatrix}}{\sqrt{\begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\0.0122&0.0180&0.0084 \end{bmatrix}}} </math>

<math>U_1 = \begin{bmatrix} 2.711&-1.489&4.370\\-3.090&3.055&-0.747\\1.807&-2.978&3.278 \end{bmatrix} </math>

'''Step 4.2: Clipped Update Vector <math>\hat{U_t} </math>'''

Scale the update vector ( '''<math>U_t </math>''' ) to ensure its RMS value does not exceed a predefined clipping threshold (<math>d </math>), maintaining stability in updates.

Formula of '''<math>\hat{U_t} </math>'''

'''<math>\hat{U_t} = \frac{U_t}{max(1,\tfrac{RMS(U_t)}{d}) } </math>'''

Compute RMS of '''<math>U_t </math>'''

'''<math>RMS(U_1) = \sqrt{\tfrac{1}{9} \sum_{i=1}^9 U_t[i]^2} \approx 3.303 </math>'''

Since RMS('''<math>U_t </math>''')>d, scale '''<math>U_t </math>''' by <math>\tfrac{1}{3.303} </math>

'''<math>\hat{U_1} = \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>'''

=== Step 5: Weight Update (<math>X_1 </math>) ===
Adjust the weights (<math>X_t </math>) by subtracting the product of the learning rate (<math>\alpha_t </math>) and the clipped update vector (<math>\hat{U_t} </math> ).

<math>X_1 = X_0 - \alpha \cdot \hat{U_t}</math>

The result for first iteration.

<math>X_1 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix} - 0.00806 \cdot \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>

<math>X_1 = \begin{bmatrix} 0.692&-0.496&0.887 \\-1.091&0.791&-0.596\\ 1.195&-0.691&0.391\end{bmatrix} </math>

== Applications ==
Adafactor is an efficient adaptive optimizer designed specifically for large-scale deep learning tasks. Its unique memory-saving properties have made it widely used for training large-scale language models, image recognition models, and reinforcement learning policy networks. Compared to other optimizers (e.g., Adam), Adafactor delivers exceptional performance in large-scale computations while significantly reducing memory requirements. Below are several specific application scenarios of Adafactor:

'''1. Natural Language Processing (NLP)'''

In NLP tasks, Adafactor has been successfully applied to training ultra-large-scale language models, such as Google’s Transformer and T5 (Text-To-Text Transfer Transformer). By significantly reducing memory usage during the gradient update process, Adafactor enables efficient model training in resource-constrained environments. For example, the T5 model in Google’s research employed Adafactor to effectively train on large datasets through text-to-text conversion tasks.2

'''2. Training Large-Scale Language Models'''

Adafactor has been used to train large-scale language models like LLaMA, combining it with novel preconditioned diagonalization methods to significantly enhance training efficiency. Experiments showed that Adafactor achieved performance comparable to the Adam optimizer while consuming substantially less memory and computational resources.3

'''3. Humor Detection Tasks'''

Adafactor has been utilized to optimize ALBERT-based models for humor detection tasks. Configured as an adaptive learning rate optimizer and paired with a cross-entropy loss function, Adafactor was used to train models that achieved 99% accuracy and F1 scores. Moreover, training time was faster than with Adam, completing in approximately 43 minutes. Comparisons with Adam and AdaBound optimizers demonstrated that Adafactor excelled in terms of both time efficiency and performance, especially in accuracy, recall, and F1 scores for humor detection tasks .4

'''4. Multilingual Model Training'''

In training multilingual models, Adafactor improved scalability and efficiency, particularly by significantly reducing memory consumption when handling large-scale parameters.5

'''5. Pretraining Vision Models'''

When training ResNet50 and ViT on the ImageNet1k dataset, Adafactor successfully optimized these deep networks with its low memory requirements. Additionally, with new algorithms combining preconditioned diagonalization methods (e.g., AdafacDiag and AdafacDiag++), it outperformed the standard Adam optimizer in both convergence speed and final accuracy.6

==== '''Software Tools and Platforms''' ====
Adafactor has been integrated into the following mainstream deep learning frameworks, making it accessible to developers:

'''TensorFlow''': Provides a built-in implementation of Adafactor.7

'''PyTorch:''' PyTorch provides the Adafactor optimizer through the torch.optim.AdaFactor class.8

'''JAX/Flax:''' JAX provides an optimizer library called Optax, which includes the Adafactor optimizer.9

==== '''Future Prospects''' ====
As the scale of deep learning models continues to grow, Adafactor’s memory-saving and computational efficiency advantages will become increasingly important. In the training of ultra-large-scale models (e.g., GPT and Vision Transformers), Adafactor is expected to become an indispensable optimization tool. Furthermore, by combining with other optimization strategies, such as mixed precision training, Adafactor may further enhance its applicability in both industrial and research settings.

== Conclusion ==
Adafactor addresses the memory consumption challenge of training large-scale deep learning models. By factorizing the second-order moment matrix and dynamically adjusting the learning rate, Adafactor minimizes resource usage without compromising performance. Adafactor can be applied to the training tasks of large language models such as Transformers, T5 models, and Vision Transformers.

== Reference ==

# Shazeer, Noam, and Mitchell Stern. "Adafactor: Adaptive learning rates with sublinear memory cost." ''International Conference on Machine Learning''. PMLR, 2018.
# Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." ''Journal of machine learning research'' 21.140 (2020): 1-67.
# DIAGONALIZATION, VIA PRECONDITIONER. "Improving Adaptive Moment Optimization via Preconditioner Diagonalization."
# Chauhan, Tavishee, and Hemant Palivela. "The Fine tuning of Language models for automation of Humor Detection." ''INFOCOMP Journal of Computer Science'' 20.2 (2021).
# Lepikhin, Dmitry, et al. "Gshard: Scaling giant models with conditional computation and automatic sharding." ''arXiv preprint arXiv:2006.16668'' (2020).
# DIAGONALIZATION, VIA PRECONDITIONER. "Improving Adaptive Moment Optimization via Preconditioner Diagonalization."
# <nowiki>https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adafactor</nowiki>
# <nowiki>https://pytorch.org/docs/stable/generated/torch.optim.Adafactor.html</nowiki>
# <nowiki>https://flax.readthedocs.io/en/v0.5.3/_autosummary/flax.optim.Adafactor.html</nowiki>

Adafactor

2024-12-15T03:04:30Z

Fall2024 Wiki Team6:

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

== Introduction ==
Adafactor is an efficient, adaptive learning rate optimization algorithm proposed by Noam Shazeer and Mitchell Stern in 2018. 1

Unlike traditional Adam optimizers, Adafactor does not store complete second-order moment matrices. Instead, it employs a factorization approach that only maintains gradient statistics for the rows and columns of parameter matrices, significantly reducing memory usage. Moreover, Adafactor uses an adaptive learning rate, allowing it to dynamically adjust step sizes without the need for manually setting a global learning rate or relying heavily on hyperparameter tuning. Its design also defaults to not performing bias correction, yet it remains stable in scenarios involving large-batch training data.1 This efficiency makes it an ideal choice for training ultra-large-scale models such as T5.2

Adafactor’s efficient memory usage and outstanding performance make it widely applicable in scenarios such as Natural Language Processing (NLP).2 Compared to the Adam optimizer, Adafactor significantly reduces memory and computational resource requirements while maintaining comparable performance when training large-scale language models and vision models. 3,6

== Problem formulation ==
=== 1. Objective ===
Minimize the loss function <math>f(x)</math>, where <math>x \in R^n</math> and <math>x</math> is the weight vector to be optimized.

=== 2. Parameters ===
*''' Gradient:'''
<math>G_t = \nabla f(x_{t-1})</math>

* '''Second moment estimate:'''

<math> \hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>

* '''Where:'''
** <math>\hat{V}_t</math> is the running average of the squared gradient.
**<math>\hat{\beta}_{2t}</math> is the corrected decay parameter.
**<math>\epsilon_1</math> is a regularization constant.

* '''Step size:'''
<math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math>
* '''Where''':
** <math>\rho_t</math> is the relative step size.
** <math>\epsilon_2</math> is a regularization constant.
** <math>\text{RMS}</math> is the root mean square, defined as:
*** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
*** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>

=== 3. Algorithms ===
==== Adafactor for Weighted Vectors ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^n</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update second moment estimate: <math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

==== Adafactor for Weighted Matrices ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^{n \times m}</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update row-wise second moment: <math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
** Update column-wise second moment: <math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
** Update overall second moment estimate: <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

=== Proposed Hyperparameters for Adafactor ===

* '''Regularization constant 1 (<math>\epsilon_1</math>):''' <math>10^{-30}</math>
**Ensures numerical stability by preventing division by zero in the calculation of second-moment estimates. This value is set extremely low to avoid instability in calculations.

* '''Regularization constant 2 (<math>\epsilon_2</math>):''' <math>10^{-3}</math>
**Helps stabilize parameter updates by controlling the scaling effect of second-moments in low-magnitude scenarios. This prevents instability caused by noise in small gradients.

* '''Clipping threshold (<math>d</math>):''' <math>1</math>
**A clipping threshold of 1 ensures stability by limiting large gradient values while maintaining sufficient learning efficiency. This avoids excessive suppression of large gradients, which could hinder learning.

* '''Relative step size (<math>\rho_t</math>):''' <math>\min(10^{-2}, 1 / \sqrt{t})</math>
**The <math>\min(10^{-2}, ...)</math> term caps the learning rate at <math>10^{-2}</math>, an empirically determined upper bound.
**The <math>1 / \sqrt{t}</math> term ensures convergence by reducing the step size over time, balancing exploration during early iterations with stability later in training.

* '''Second moment decay (<math>\hat{\beta}_{2t}</math>):''' <math>1 - t^{-0.8}</math>
**The decay factor remains close to 1 initially to allow rapid adaptation.
**The <math>t^{-0.8}</math> power balances between rapid learning in early training and stability during later stages, ensuring smoother convergence.

=== 5. Discussion ===

==== Why Clipping ====
Adafactor employs clipping to maintain numerical stability, especially since it is designed for use with very large models and often works with unscaled learning rates.
* Clipping prevents the update step from becoming very large, which would destabilize training
* Clipping mitigates the effects of very large gradients preventing numerical instability
Therefore, implementing clipping helps ensure stability and efficient training without requiring per-parameter scaling like Adam.

==== Why Adafactor is more memory efficient, compared to Adam ====
'''Row-wise and Column-wise Second Moment Updates'''
*<math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
*<math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
Instead of storing the full <math>G_t^2</math>, Adafactor computes the row and column respectively, which reduces the memory requirements from <math>O(n\times m)</math> to <math>O(n + m)</math>

'''Factored Representation of the Second Moment'''
* <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
This updates the second momentum based on the outer product <math>R_t C_t</math>.
*However, this is not <math>O(n\times m)</math> since
** The operation is performed element-wise, so it actually never materializes <math>\hat{V_t}</math> as a <math>n\times n</math> matrix
** It also only storing <math>R_t</math>and <math> C_t</math> instead of storage the full second-moment matrix

== Numerical Examples ==
Step-by-step instructions for determining the result of the first iteration.

=== Problem setup ===
'''Initial weights ('''<math>X_0</math>'''):'''

<math>X_0 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}</math>

'''Gradient for first iteration (<math>G_1</math>):'''

Gradient of the loss function with respect to X

<math>G_1 = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math>

=== Hyperparameters setup ===
<math>\epsilon_1 = 10^{-30}</math> (Minimum learning rate scaling factor))

<math>\epsilon_2 = 10^{-3}</math> (Regularization constant)

<math>d = 1</math> (Clipping threshold)

<math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> (Relative step size)

<math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> (Second moment decay)

=== Step 1: Learning Rate Scaling ===
Define the relative step size

<math>\rho_1 = \min(10^{-2}, 1/\sqrt{1})= 10^{-2}</math>

'''Step 1.1: Root Mean Square(RMS) calculation for <math>X_0</math>'''

Root Mean Square(RMS) calculation for <math>X_0</math>

RMS formula

<math>RMS(X_0) = \sqrt{\tfrac{1}{n}\sum_{i=1}^n X_0[i]^2}</math>

Substitute the initial weights

<math>RMS(X_0) = \sqrt{\tfrac{1}{9}(0.72^2+(-0.5)^2+0.9^2+(-1.1)^2+0.8^2+(-0.6)^2+1.2^2+(-0.7)^2+0.4^2)}</math>

<math>RMS(X_0) = \sqrt{\frac{6.85}{9}}\approx 0.806</math>

'''Step 1.2: Find the Learning Rate Scaling ('''<math>\alpha_t</math>'''):'''

Learning rate formula

<math>\alpha_1 = max(\epsilon_2,RMS(X_0))\cdot p_1</math>

Substitute the RMS

<math>\alpha_1 = max(0.001,0.806)\cdot 0.01=0.00806</math>

=== Step 2: Compute <math>G^{2}_t</math> (Element-wise Square of Gradient) ===
Compute the squared value of each element in the gradient matrix '''<math>G_t</math>'''.

<math>G^{2}_1 = \begin{bmatrix} 0.3^2&(-0.2)^2&0.4^2\\ (-0.5)^2&0.6^2&(-0.1)^2\\0.2^2&(-0.4)^2 &0.3^2 \end{bmatrix}</math>

<math>G^{2}_1 = \begin{bmatrix} 0.09& 0.04&0.16\\ 0.25&0.36&0.01\\0.04&0.16&0.09\end{bmatrix}</math>

=== Step 3: Find the moment estimate ===
Compute the exponential moving average of squared gradients to capture the variance or scale of gradients.

'''Step 3.1: Compute row moments (<math>R_t</math>)'''

This equation computes the row-wise second moments ('''<math>R_t</math>''' ) as an exponential moving average of past moments ('''<math>R_{t-1}</math>''') and the current row-wise mean of squared gradients ( <math>G^{2}_t</math> ), with a balance controlled by (<math>\hat{\beta}_{2t}</math>).

For <math>G^{2}_t=\mathbb{R}^{m\times n} </math>

<math>R_t = \hat{\beta_{2t}} \cdot R_{t-1} + (1-\hat{\beta})\cdot (\tfrac{1}{m}\sum_{j=1}^m G^{2}_t[i,j]+\epsilon_1) </math>

Since <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>, for first iteration: <math>\hat{\beta}_{21} = 0</math>. And because <math>\epsilon_1 </math> is too small, we can ignore it. The update of '''<math>R_t</math>''' is:

<math>R_{1} = \tfrac{1}{m}\textstyle \sum_{j=1}^m \displaystyle G^{2}_1[i,j] </math>

Row-wise mean ('''<math>R_t</math>'''):

<math>R_1 = \begin{bmatrix} \tfrac{0.09+0.04+0.16}{3} \\ \tfrac{0.25+0.36+0.01}{3}\\\tfrac{0.04+0.16+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.0967\\ 0.2067\\0.0967\end{bmatrix} </math>

'''Step 3.2: Compute column moments (<math>C_t</math>)'''

The process is same as row moments.

<math>C_t = \hat{\beta}\cdot C_{{t-1}} + (1-\hat{\beta})\cdot (\tfrac{1}{n}\sum_{j=1}^n G^{2}_t[i,j]+\epsilon_1) </math>

Column-wise mean (<math>C_t</math>):

<math>C_1 = \begin{bmatrix} \tfrac{0.09+025+0.04}{3} \\ \tfrac{0.04+0.36+0.16}{3}\\\tfrac{0.16+0.01+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.1267\\ 0.1867\\0.0867\end{bmatrix} </math>

'''Step 3.3: Second Moment Estimate ('''<math>\hat{V_t}</math>''')'''

The Second Moment Estimate is calculated as the outer product of the row moments ('''<math>R_t</math>''') and column moments ('''<math>C_t</math>''').

<math>\hat{V}_t = R_t \otimes C_t</math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0967\\0.2067\\0.0967 \end{bmatrix} \otimes \begin{bmatrix} 0.1267&0.1867&0.0867\\ \end{bmatrix} </math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\ 0.0122&0.0180&0.0084\end{bmatrix} </math>

=== Step 4: Update the vector (<math>U_t </math>) ===
Computed by scaling the gradient matrix '''<math>G_t</math>''' element-wise with the inverse square root of the second moment estimate (<math>\hat{V_t}</math>)

'''Step 4.1: Find the vector value of <math>U_t </math>'''

Formula of '''<math>U_t </math>'''

<math>U_t = \frac{G_t}{\sqrt{\hat{V_t}+\epsilon_1}} </math>

Substitute '''<math>C_t</math>''' and <math>V_t</math>

<math>U_1 = \frac{\begin{bmatrix}0.3&-0.2&0.4 \\ -0.5&0.6&-0.1\\0.2&-0.4&0.3 \end{bmatrix}}{\sqrt{\begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\0.0122&0.0180&0.0084 \end{bmatrix}}} </math>

<math>U_1 = \begin{bmatrix} 2.711&-1.489&4.370\\-3.090&3.055&-0.747\\1.807&-2.978&3.278 \end{bmatrix} </math>

'''Step 4.2: Clipped Update Vector <math>\hat{U_t} </math>'''

Scale the update vector ( '''<math>U_t </math>''' ) to ensure its RMS value does not exceed a predefined clipping threshold (<math>d </math>), maintaining stability in updates.

Formula of '''<math>\hat{U_t} </math>'''

'''<math>\hat{U_t} = \frac{U_t}{max(1,\tfrac{RMS(U_t)}{d}) } </math>'''

Compute RMS of '''<math>U_t </math>'''

'''<math>RMS(U_1) = \sqrt{\tfrac{1}{9} \sum_{i=1}^9 U_t[i]^2} \approx 3.303 </math>'''

Since RMS('''<math>U_t </math>''')>d, scale '''<math>U_t </math>''' by <math>\tfrac{1}{3.303} </math>

'''<math>\hat{U_1} = \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>'''

=== '''<big>Step 5: Weight Update (</big>'''<math>X_1 </math>'''<big>)</big>''' ===
Adjust the weights (<math>X_t </math>) by subtracting the product of the learning rate (<math>\alpha_t </math>) and the clipped update vector (<math>\hat{U_t} </math> ).

<math>X_1 = X_0 - \alpha \cdot \hat{U_t}</math>

The result for first iteration.

<math>X_1 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix} - 0.00806 \cdot \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>

<math>X_1 = \begin{bmatrix} 0.692&-0.496&0.887 \\-1.091&0.791&-0.596\\ 1.195&-0.691&0.391\end{bmatrix} </math>

== Applications ==
Adafactor is an efficient adaptive optimizer designed specifically for large-scale deep learning tasks. Its unique memory-saving properties have made it widely used for training large-scale language models, image recognition models, and reinforcement learning policy networks. Compared to other optimizers (e.g., Adam), Adafactor delivers exceptional performance in large-scale computations while significantly reducing memory requirements. Below are several specific application scenarios of Adafactor:

'''1. Natural Language Processing (NLP)'''

In NLP tasks, Adafactor has been successfully applied to training ultra-large-scale language models, such as Google’s Transformer and T5 (Text-To-Text Transfer Transformer). By significantly reducing memory usage during the gradient update process, Adafactor enables efficient model training in resource-constrained environments. For example, the T5 model in Google’s research employed Adafactor to effectively train on large datasets through text-to-text conversion tasks.2

'''2. Training Large-Scale Language Models'''

Adafactor has been used to train large-scale language models like LLaMA, combining it with novel preconditioned diagonalization methods to significantly enhance training efficiency. Experiments showed that Adafactor achieved performance comparable to the Adam optimizer while consuming substantially less memory and computational resources.3

'''3. Humor Detection Tasks'''

Adafactor has been utilized to optimize ALBERT-based models for humor detection tasks. Configured as an adaptive learning rate optimizer and paired with a cross-entropy loss function, Adafactor was used to train models that achieved 99% accuracy and F1 scores. Moreover, training time was faster than with Adam, completing in approximately 43 minutes. Comparisons with Adam and AdaBound optimizers demonstrated that Adafactor excelled in terms of both time efficiency and performance, especially in accuracy, recall, and F1 scores for humor detection tasks .4

'''4. Multilingual Model Training'''

In training multilingual models, Adafactor improved scalability and efficiency, particularly by significantly reducing memory consumption when handling large-scale parameters.5

'''5. Pretraining Vision Models'''

When training ResNet50 and ViT on the ImageNet1k dataset, Adafactor successfully optimized these deep networks with its low memory requirements. Additionally, with new algorithms combining preconditioned diagonalization methods (e.g., AdafacDiag and AdafacDiag++), it outperformed the standard Adam optimizer in both convergence speed and final accuracy.6

==== '''Software Tools and Platforms''' ====
Adafactor has been integrated into the following mainstream deep learning frameworks, making it accessible to developers:

'''TensorFlow''': Provides a built-in implementation of Adafactor.7

'''PyTorch:''' PyTorch provides the Adafactor optimizer through the torch.optim.AdaFactor class.8

'''JAX/Flax:''' JAX provides an optimizer library called Optax, which includes the Adafactor optimizer.9

==== '''Future Prospects''' ====
As the scale of deep learning models continues to grow, Adafactor’s memory-saving and computational efficiency advantages will become increasingly important. In the training of ultra-large-scale models (e.g., GPT and Vision Transformers), Adafactor is expected to become an indispensable optimization tool. Furthermore, by combining with other optimization strategies, such as mixed precision training, Adafactor may further enhance its applicability in both industrial and research settings.

== Conclusion ==
Adafactor addresses the memory consumption challenge of training large-scale deep learning models. By factorizing the second-order moment matrix and dynamically adjusting the learning rate, Adafactor minimizes resource usage without compromising performance. Adafactor can be applied to the training tasks of large language models such as Transformers, T5 models, and Vision Transformers.

== Reference ==

# Shazeer, Noam, and Mitchell Stern. "Adafactor: Adaptive learning rates with sublinear memory cost." ''International Conference on Machine Learning''. PMLR, 2018.
# Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." ''Journal of machine learning research'' 21.140 (2020): 1-67.
# DIAGONALIZATION, VIA PRECONDITIONER. "Improving Adaptive Moment Optimization via Preconditioner Diagonalization."
# Chauhan, Tavishee, and Hemant Palivela. "The Fine tuning of Language models for automation of Humor Detection." ''INFOCOMP Journal of Computer Science'' 20.2 (2021).
# Lepikhin, Dmitry, et al. "Gshard: Scaling giant models with conditional computation and automatic sharding." ''arXiv preprint arXiv:2006.16668'' (2020).
# DIAGONALIZATION, VIA PRECONDITIONER. "Improving Adaptive Moment Optimization via Preconditioner Diagonalization."
# <nowiki>https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adafactor</nowiki>
# <nowiki>https://pytorch.org/docs/stable/generated/torch.optim.Adafactor.html</nowiki>
# <nowiki>https://flax.readthedocs.io/en/v0.5.3/_autosummary/flax.optim.Adafactor.html</nowiki>

Adafactor

2024-12-15T02:43:25Z

Fall2024 Wiki Team6: /* Numerical Examples */

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

== Introduction ==
Adafactor is an efficient, adaptive learning rate optimization algorithm proposed by Noam Shazeer and Mitchell Stern in 2018. 1

Unlike traditional Adam optimizers, Adafactor does not store complete second-order moment matrices. Instead, it employs a factorization approach that only maintains gradient statistics for the rows and columns of parameter matrices, significantly reducing memory usage. Moreover, Adafactor uses an adaptive learning rate, allowing it to dynamically adjust step sizes without the need for manually setting a global learning rate or relying heavily on hyperparameter tuning. Its design also defaults to not performing bias correction, yet it remains stable in scenarios involving large-batch training data.1 This efficiency makes it an ideal choice for training ultra-large-scale models such as T5.2

Adafactor’s efficient memory usage and outstanding performance make it widely applicable in scenarios such as Natural Language Processing (NLP).2 Compared to the Adam optimizer, Adafactor significantly reduces memory and computational resource requirements while maintaining comparable performance when training large-scale language models and vision models. 3,6

== Problem formulation ==
=== 1. Objective ===
Minimize the loss function <math>f(x)</math>, where <math>x \in R^n</math> and <math>x</math> is the weight vector to be optimized.

=== 2. Parameters ===
*''' Gradient:'''
<math>G_t = \nabla f(x_{t-1})</math>

* '''Second moment estimate:'''

<math> \hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>

* '''Where:'''
** <math>\hat{V}_t</math> is the running average of the squared gradient.
**<math>\hat{\beta}_{2t}</math> is the corrected decay parameter.
**<math>\epsilon_1</math> is a regularization constant.

* '''Step size:'''
<math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math>
* '''Where''':
** <math>\rho_t</math> is the relative step size.
** <math>\epsilon_2</math> is a regularization constant.
** <math>\text{RMS}</math> is the root mean square, defined as:
*** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
*** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>

=== 3. Algorithms ===
==== Adafactor for Weighted Vectors ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^n</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update second moment estimate: <math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

==== Adafactor for Weighted Matrices ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^{n \times m}</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update row-wise second moment: <math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
** Update column-wise second moment: <math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
** Update overall second moment estimate: <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

=== Proposed Hyperparameters for Adafactor ===

* '''Regularization constant 1 (<math>\epsilon_1</math>):''' <math>10^{-30}</math>
**Ensures numerical stability by preventing division by zero in the calculation of second-moment estimates. This value is set extremely low to avoid instability in calculations.

* '''Regularization constant 2 (<math>\epsilon_2</math>):''' <math>10^{-3}</math>
**Helps stabilize parameter updates by controlling the scaling effect of second-moments in low-magnitude scenarios. This prevents instability caused by noise in small gradients.

* '''Clipping threshold (<math>d</math>):''' <math>1</math>
**A clipping threshold of 1 ensures stability by limiting large gradient values while maintaining sufficient learning efficiency. This avoids excessive suppression of large gradients, which could hinder learning.

* '''Relative step size (<math>\rho_t</math>):''' <math>\min(10^{-2}, 1 / \sqrt{t})</math>
**The <math>\min(10^{-2}, ...)</math> term caps the learning rate at <math>10^{-2}</math>, an empirically determined upper bound.
**The <math>1 / \sqrt{t}</math> term ensures convergence by reducing the step size over time, balancing exploration during early iterations with stability later in training.

* '''Second moment decay (<math>\hat{\beta}_{2t}</math>):''' <math>1 - t^{-0.8}</math>
**The decay factor remains close to 1 initially to allow rapid adaptation.
**The <math>t^{-0.8}</math> power balances between rapid learning in early training and stability during later stages, ensuring smoother convergence.

=== 5. Discussion ===

==== Why Clipping ====
Adafactor employs clipping to maintain numerical stability, especially since it is designed for use with very large models and often works with unscaled learning rates.
* Clipping prevents the update step from becoming very large, which would destabilize training
* Clipping mitigates the effects of very large gradients preventing numerical instability
Therefore, implementing clipping helps ensure stability and efficient training without requiring per-parameter scaling like Adam.

==== Why Adafactor is more memory efficient, compared to Adam ====
'''Row-wise and Column-wise Second Moment Updates'''
*<math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
*<math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
Instead of storing the full <math>G_t^2</math>, Adafactor computes the row and column respectively, which reduces the memory requirements from <math>O(n\times m)</math> to <math>O(n + m)</math>

'''Factored Representation of the Second Moment'''
* <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
This updates the second momentum based on the outer product <math>R_t C_t</math>.
*However, this is not <math>O(n\times m)</math> since
** The operation is performed element-wise, so it actually never materializes <math>\hat{V_t}</math> as a <math>n\times n</math> matrix
** It also only storing <math>R_t</math>and <math> C_t</math> instead of storage the full second-moment matrix

== Numerical Examples ==
Step-by-step instructions for determining the result of the first iteration.

'''<big>Problem setup</big>'''

'''Initial weights ('''<math>X_0</math>'''):'''

<math>X_0 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}</math>

'''Gradient for first iteration (<math>G_1</math>):'''

Gradient of the loss function with respect to X

<math>G_1 = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math>

'''<big>Hyperparameters setup</big>'''

<math>\epsilon_1 = 10^{-30}</math> (Minimum learning rate scaling factor))

<math>\epsilon_2 = 10^{-3}</math> (Regularization constant)

<math>d = 1</math> (Clipping threshold)

<math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> (Relative step size)

<math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> (Second moment decay)

'''<big>Step 1: Learning Rate Scaling</big>'''

Define the relative step size

<math>\rho_1 = \min(10^{-2}, 1/\sqrt{1})= 10^{-2}</math>

'''Step 1.1: Root Mean Square(RMS) calculation for <math>X_0</math>'''

Root Mean Square(RMS) calculation for <math>X_0</math>

RMS formula

<math>RMS(X_0) = \sqrt{\tfrac{1}{n}\sum_{i=1}^n X_0[i]^2}</math>

Substitute the initial weights

<math>RMS(X_0) = \sqrt{\tfrac{1}{9}(0.72^2+(-0.5)^2+0.9^2+(-1.1)^2+0.8^2+(-0.6)^2+1.2^2+(-0.7)^2+0.4^2)}</math>

<math>RMS(X_0) = \sqrt{\frac{6.85}{9}}\approx 0.806</math>

'''Step 1.2: Find the Learning Rate Scaling ('''<math>\alpha_t</math>'''):'''

Learning rate formula

<math>\alpha_1 = max(\epsilon_2,RMS(X_0))\cdot p_1</math>

Substitute the RMS

<math>\alpha_1 = max(0.001,0.806)\cdot 0.01=0.00806</math>

'''<big>Step 2: Compute <math>G^{2}_t</math> (Element-wise Square of Gradient)</big>'''

Compute the squared value of each element in the gradient matrix '''<math>G_t</math>'''.

<math>G^{2}_1 = \begin{bmatrix} 0.3^2&(-0.2)^2&0.4^2\\ (-0.5)^2&0.6^2&(-0.1)^2\\0.2^2&(-0.4)^2 &0.3^2 \end{bmatrix}</math>

<math>G^{2}_1 = \begin{bmatrix} 0.09& 0.04&0.16\\ 0.25&0.36&0.01\\0.04&0.16&0.09\end{bmatrix}</math>

'''<big>Step 3: Find the moment estimate</big>'''

Compute the exponential moving average of squared gradients to capture the variance or scale of gradients.

'''Step 3.1: Compute row moments (<math>R_t</math>)'''

This equation computes the row-wise second moments ('''<math>R_t</math>''' ) as an exponential moving average of past moments ('''<math>R_{t-1}</math>''') and the current row-wise mean of squared gradients ( <math>G^{2}_t</math> ), with a balance controlled by (<math>\hat{\beta}_{2t}</math>).

For <math>G^{2}_t=\mathbb{R}^{m\times n} </math>

<math>R_t = \hat{\beta_{2t}} \cdot R_{t-1} + (1-\hat{\beta})\cdot (\tfrac{1}{m}\sum_{j=1}^m G^{2}_t[i,j]+\epsilon_1) </math>

Since <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>, for first iteration: <math>\hat{\beta}_{21} = 0</math>. And because <math>\epsilon_1 </math> is too small, we can ignore it. The update of '''<math>R_t</math>''' is:

<math>R_{1} = \tfrac{1}{m}\textstyle \sum_{j=1}^m \displaystyle G^{2}_1[i,j] </math>

Row-wise mean ('''<math>R_t</math>'''):

<math>R_1 = \begin{bmatrix} \tfrac{0.09+0.04+0.16}{3} \\ \tfrac{0.25+0.36+0.01}{3}\\\tfrac{0.04+0.16+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.0967\\ 0.2067\\0.0967\end{bmatrix} </math>

'''Step 3.2: Compute column moments (<math>C_t</math>)'''

The process is same as row moments.

<math>C_t = \hat{\beta}\cdot C_{{t-1}} + (1-\hat{\beta})\cdot (\tfrac{1}{n}\sum_{j=1}^n G^{2}_t[i,j]+\epsilon_1) </math>

Column-wise mean (<math>C_t</math>):

<math>C_1 = \begin{bmatrix} \tfrac{0.09+025+0.04}{3} \\ \tfrac{0.04+0.36+0.16}{3}\\\tfrac{0.16+0.01+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.1267\\ 0.1867\\0.0867\end{bmatrix} </math>

'''Step 3.3: Second Moment Estimate ('''<math>\hat{V_t}</math>''')'''

The Second Moment Estimate is calculated as the outer product of the row moments ('''<math>R_t</math>''') and column moments ('''<math>C_t</math>''').

<math>\hat{V}_t = R_t \otimes C_t</math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0967\\0.2067\\0.0967 \end{bmatrix} \otimes \begin{bmatrix} 0.1267&0.1867&0.0867\\ \end{bmatrix} </math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\ 0.0122&0.0180&0.0084\end{bmatrix} </math>

'''<big>Step 4: Update the vector (<math>U_t </math>)</big>'''

Computed by scaling the gradient matrix '''<math>G_t</math>''' element-wise with the inverse square root of the second moment estimate (<math>\hat{V_t}</math>)

'''Step 4.1: Find the vector value of <math>U_t </math>'''

Formula of '''<math>U_t </math>'''

<math>U_t = \frac{G_t}{\sqrt{\hat{V_t}+\epsilon_1}} </math>

Substitute '''<math>C_t</math>''' and <math>V_t</math>

<math>U_1 = \frac{\begin{bmatrix}0.3&-0.2&0.4 \\ -0.5&0.6&-0.1\\0.2&-0.4&0.3 \end{bmatrix}}{\sqrt{\begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\0.0122&0.0180&0.0084 \end{bmatrix}}} </math>

<math>U_1 = \begin{bmatrix} 2.711&-1.489&4.370\\-3.090&3.055&-0.747\\1.807&-2.978&3.278 \end{bmatrix} </math>

'''Step 4.2: Clipped Update Vector <math>\hat{U_t} </math>'''

Scale the update vector ( '''<math>U_t </math>''' ) to ensure its RMS value does not exceed a predefined clipping threshold (<math>d </math>), maintaining stability in updates.

Formula of '''<math>\hat{U_t} </math>'''

'''<math>\hat{U_t} = \frac{U_t}{max(1,\tfrac{RMS(U_t)}{d}) } </math>'''

Compute RMS of '''<math>U_t </math>'''

'''<math>RMS(U_1) = \sqrt{\tfrac{1}{9} \sum_{i=1}^9 U_t[i]^2} \approx 3.303 </math>'''

Since RMS('''<math>U_t </math>''')>d, scale '''<math>U_t </math>''' by <math>\tfrac{1}{3.303} </math>

'''<math>\hat{U_1} = \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>'''

'''<big>Step 5: Weight Update (</big>'''<math>X_1 </math>'''<big>)</big>'''

Adjust the weights (<math>X_t </math>) by subtracting the product of the learning rate (<math>\alpha_t </math>) and the clipped update vector (<math>\hat{U_t} </math> ).

<math>X_1 = X_0 - \alpha \cdot \hat{U_t}</math>

The result for first iteration.

<math>X_1 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix} - 0.00806 \cdot \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>

<math>X_1 = \begin{bmatrix} 0.692&-0.496&0.887 \\-1.091&0.791&-0.596\\ 1.195&-0.691&0.391\end{bmatrix} </math>

== Applications ==
Adafactor is an efficient adaptive optimizer designed specifically for large-scale deep learning tasks. Its unique memory-saving properties have made it widely used for training large-scale language models, image recognition models, and reinforcement learning policy networks. Compared to other optimizers (e.g., Adam), Adafactor delivers exceptional performance in large-scale computations while significantly reducing memory requirements. Below are several specific application scenarios of Adafactor:

'''1. Natural Language Processing (NLP)'''

In NLP tasks, Adafactor has been successfully applied to training ultra-large-scale language models, such as Google’s Transformer and T5 (Text-To-Text Transfer Transformer). By significantly reducing memory usage during the gradient update process, Adafactor enables efficient model training in resource-constrained environments. For example, the T5 model in Google’s research employed Adafactor to effectively train on large datasets through text-to-text conversion tasks.2

'''2. Training Large-Scale Language Models'''

Adafactor has been used to train large-scale language models like LLaMA, combining it with novel preconditioned diagonalization methods to significantly enhance training efficiency. Experiments showed that Adafactor achieved performance comparable to the Adam optimizer while consuming substantially less memory and computational resources.3

'''3. Humor Detection Tasks'''

Adafactor has been utilized to optimize ALBERT-based models for humor detection tasks. Configured as an adaptive learning rate optimizer and paired with a cross-entropy loss function, Adafactor was used to train models that achieved 99% accuracy and F1 scores. Moreover, training time was faster than with Adam, completing in approximately 43 minutes. Comparisons with Adam and AdaBound optimizers demonstrated that Adafactor excelled in terms of both time efficiency and performance, especially in accuracy, recall, and F1 scores for humor detection tasks .4

'''4. Multilingual Model Training'''

In training multilingual models, Adafactor improved scalability and efficiency, particularly by significantly reducing memory consumption when handling large-scale parameters.5

'''5. Pretraining Vision Models'''

When training ResNet50 and ViT on the ImageNet1k dataset, Adafactor successfully optimized these deep networks with its low memory requirements. Additionally, with new algorithms combining preconditioned diagonalization methods (e.g., AdafacDiag and AdafacDiag++), it outperformed the standard Adam optimizer in both convergence speed and final accuracy.6

==== '''Software Tools and Platforms''' ====
Adafactor has been integrated into the following mainstream deep learning frameworks, making it accessible to developers:

'''TensorFlow''': Provides a built-in implementation of Adafactor.7

'''PyTorch:''' PyTorch provides the Adafactor optimizer through the torch.optim.AdaFactor class.8

'''JAX/Flax:''' JAX provides an optimizer library called Optax, which includes the Adafactor optimizer.9

==== '''Future Prospects''' ====
As the scale of deep learning models continues to grow, Adafactor’s memory-saving and computational efficiency advantages will become increasingly important. In the training of ultra-large-scale models (e.g., GPT and Vision Transformers), Adafactor is expected to become an indispensable optimization tool. Furthermore, by combining with other optimization strategies, such as mixed precision training, Adafactor may further enhance its applicability in both industrial and research settings.

== Conclusion ==
Adafactor addresses the memory consumption challenge of training large-scale deep learning models. By factorizing the second-order moment matrix and dynamically adjusting the learning rate, Adafactor minimizes resource usage without compromising performance. Adafactor can be applied to the training tasks of large language models such as Transformers, T5 models, and Vision Transformers.

== Reference ==

# Shazeer, Noam, and Mitchell Stern. "Adafactor: Adaptive learning rates with sublinear memory cost." ''International Conference on Machine Learning''. PMLR, 2018.
# Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." ''Journal of machine learning research'' 21.140 (2020): 1-67.
# DIAGONALIZATION, VIA PRECONDITIONER. "Improving Adaptive Moment Optimization via Preconditioner Diagonalization."
# Chauhan, Tavishee, and Hemant Palivela. "The Fine tuning of Language models for automation of Humor Detection." ''INFOCOMP Journal of Computer Science'' 20.2 (2021).
# Lepikhin, Dmitry, et al. "Gshard: Scaling giant models with conditional computation and automatic sharding." ''arXiv preprint arXiv:2006.16668'' (2020).
# DIAGONALIZATION, VIA PRECONDITIONER. "Improving Adaptive Moment Optimization via Preconditioner Diagonalization."
# <nowiki>https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adafactor</nowiki>
# <nowiki>https://pytorch.org/docs/stable/generated/torch.optim.Adafactor.html</nowiki>
# <nowiki>https://flax.readthedocs.io/en/v0.5.3/_autosummary/flax.optim.Adafactor.html</nowiki>

Adafactor

2024-12-13T21:54:00Z

Fall2024 Wiki Team6: /* Software Tools and Platforms */

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

== Introduction ==
Adafactor is an efficient, adaptive learning rate optimization algorithm proposed by Noam Shazeer and Mitchell Stern in 2018. 1

Unlike traditional Adam optimizers, Adafactor does not store complete second-order moment matrices. Instead, it employs a factorization approach that only maintains gradient statistics for the rows and columns of parameter matrices, significantly reducing memory usage. Moreover, Adafactor uses an adaptive learning rate, allowing it to dynamically adjust step sizes without the need for manually setting a global learning rate or relying heavily on hyperparameter tuning. Its design also defaults to not performing bias correction, yet it remains stable in scenarios involving large-batch training data.1 This efficiency makes it an ideal choice for training ultra-large-scale models such as T5.2

Adafactor’s efficient memory usage and outstanding performance make it widely applicable in scenarios such as Natural Language Processing (NLP).2 Compared to the Adam optimizer, Adafactor significantly reduces memory and computational resource requirements while maintaining comparable performance when training large-scale language models and vision models. 3,6

== Problem formulation ==
=== 1. Objective ===
Minimize the loss function <math>f(x)</math>, where <math>x \in R^n</math> and <math>x</math> is the weight vector to be optimized.

=== 2. Parameters ===
*''' Gradient:'''
<math>G_t = \nabla f(x_{t-1})</math>

* '''Second moment estimate:'''

<math> \hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>

* '''Where:'''
** <math>\hat{V}_t</math> is the running average of the squared gradient.
**<math>\hat{\beta}_{2t}</math> is the corrected decay parameter.
**<math>\epsilon_1</math> is a regularization constant.

* '''Step size:'''
<math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math>
* '''Where''':
** <math>\rho_t</math> is the relative step size.
** <math>\epsilon_2</math> is a regularization constant.
** <math>\text{RMS}</math> is the root mean square, defined as:
*** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
*** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>

=== 3. Algorithms ===
==== Adafactor for Weighted Vectors ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^n</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update second moment estimate: <math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

==== Adafactor for Weighted Matrices ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^{n \times m}</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update row-wise second moment: <math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
** Update column-wise second moment: <math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
** Update overall second moment estimate: <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

=== Proposed Hyperparameters for Adafactor ===

* '''Regularization constant 1 (<math>\epsilon_1</math>):''' <math>10^{-30}</math>
**Ensures numerical stability by preventing division by zero in the calculation of second-moment estimates. This value is set extremely low to avoid instability in calculations.

* '''Regularization constant 2 (<math>\epsilon_2</math>):''' <math>10^{-3}</math>
**Helps stabilize parameter updates by controlling the scaling effect of second-moments in low-magnitude scenarios. This prevents instability caused by noise in small gradients.

* '''Clipping threshold (<math>d</math>):''' <math>1</math>
**A clipping threshold of 1 ensures stability by limiting large gradient values while maintaining sufficient learning efficiency. This avoids excessive suppression of large gradients, which could hinder learning.

* '''Relative step size (<math>\rho_t</math>):''' <math>\min(10^{-2}, 1 / \sqrt{t})</math>
**The <math>\min(10^{-2}, ...)</math> term caps the learning rate at <math>10^{-2}</math>, an empirically determined upper bound.
**The <math>1 / \sqrt{t}</math> term ensures convergence by reducing the step size over time, balancing exploration during early iterations with stability later in training.

* '''Second moment decay (<math>\hat{\beta}_{2t}</math>):''' <math>1 - t^{-0.8}</math>
**The decay factor remains close to 1 initially to allow rapid adaptation.
**The <math>t^{-0.8}</math> power balances between rapid learning in early training and stability during later stages, ensuring smoother convergence.

=== 5. Discussion ===

==== Why Clipping ====
Adafactor employs clipping to maintain numerical stability, especially since it is designed for use with very large models and often works with unscaled learning rates.
* Clipping prevents the update step from becoming very large, which would destabilize training
* Clipping mitigates the effects of very large gradients preventing numerical instability
Therefore, implementing clipping helps ensure stability and efficient training without requiring per-parameter scaling like Adam.

==== Why Adafactor is more memory efficient, compared to Adam ====
'''Row-wise and Column-wise Second Moment Updates'''
*<math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
*<math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
Instead of storing the full <math>G_t^2</math>, Adafactor computes the row and column respectively, which reduces the memory requirements from <math>O(n\times m)</math> to <math>O(n + m)</math>

'''Factored Representation of the Second Moment'''
* <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
This updates the second momentum based on the outer product <math>R_t C_t</math>.
*However, this is not <math>O(n\times m)</math> since
** The operation is performed element-wise, so it actually never materializes <math>\hat{V_t}</math> as a <math>n\times n</math> matrix
** It also only storing <math>R_t</math>and <math> C_t</math> instead of storage the full second-moment matrix

== Numerical Examples ==
Step-by-step instructions for determining the result of the first iteration.

'''<big>Problem setup</big>'''

'''Initial weights ('''<math>X_0</math>'''):'''

<math>X_0 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}</math>

'''Gradient for first iteration (<math>G_1</math>):'''

Gradient of the loss function with respect to X

<math>G_1 = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math>

'''<big>Hyperparameters setup</big>'''

<math>\epsilon_1 = 10^{-30}</math> (Minimum learning rate scaling factor))

<math>\epsilon_2 = 10^{-3}</math> (Regularization constant)

<math>d = 1</math> (Clipping threshold)

<math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> (Relative step size)

<math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> (Second moment decay)

'''<big>Step 1: Learning Rate Scaling</big>'''

Define the relative step size

<math>\rho_1 = \min(10^{-2}, 1/\sqrt{1})= 10^{-2}</math>

'''Step 1.1: Root Mean Square(RMS) calculation for <math>X_0</math>'''

Root Mean Square(RMS) calculation for <math>X_0</math>

RMS formula

<math>RMS(X_0) = \sqrt{\tfrac{1}{n}\sum_{i=1}^n X_0[i]^2}</math>

Substitute the initial weights

<math>RMS(X_0) = \sqrt{\tfrac{1}{9}(0.72^2+(-0.5)^2+0.9^2+(-1.1)^2+0.8^2+(-0.6)^2+1.2^2+(-0.7)^2+0.4^2)}</math>

<math>RMS(X_0) = \sqrt{\frac{6.85}{9}}\approx 0.806</math>

'''Step 1.2: Find the Learning Rate Scaling ('''<math>\alpha_t</math>'''):'''

Learning rate formula

<math>\alpha_1 = max(\epsilon_2,RMS(X_0))\cdot p_1</math>

Substitute the RMS

<math>\alpha_1 = max(0.001,0.806)\cdot 0.01=0.00806</math>

'''<big>Step 2: Compute <math>G^{2}_t</math> (Element-wise Square of Gradient)</big>'''

Compute the squared value of each element in the gradient matrix '''<math>G_t</math>'''.

<math>G^{2}_1 = \begin{bmatrix} 0.3^2&(-0.2)^2&0.4^2\\ (-0.5)^2&0.6^2&(-0.1)^2\\0.2^2&(-0.4)^2 &0.3^2 \end{bmatrix}</math>

<math>G^{2}_1 = \begin{bmatrix} 0.09& 0.04&0.16\\ 0.25&0.36&0.01\\0.04&0.16&0.09\end{bmatrix}</math>

'''<big>Step 3: Find the moment estimate</big>'''

Compute the exponential moving average of squared gradients to capture the variance or scale of gradients.

'''Step 3.1: Compute row moments (<math>R_t</math>)'''

This equation computes the row-wise second moments ('''<math>R_t</math>''' ) as an exponential moving average of past moments ('''<math>R_{t-1}</math>''') and the current row-wise mean of squared gradients ( <math>G^{2}_t</math> ), with a balance controlled by (<math>\hat{\beta}_{2t}</math>).

For <math>G^{2}_t=\mathbb{R}^{m\times n} </math>

<math>R_t = \hat{\beta_{2t}} \cdot R_{t-1} + (1-\hat{\beta})\cdot (\tfrac{1}{m}\sum_{j=1}^m G^{2}_t[i,j]+\epsilon_1) </math>

Since <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>, for first iteration: <math>\hat{\beta}_{21} = 0</math>. And because <math>\epsilon_1 </math> is too small, we can ignore it. The update of '''<math>R_t</math>''' is:

<math>R_{1} = \tfrac{1}{m}\textstyle \sum_{j=1}^m \displaystyle G^{2}_1[i,j] </math>

Row-wise mean ('''<math>R_t</math>'''):

<math>R_1 = \begin{bmatrix} \tfrac{0.09+0.04+0.16}{3} \\ \tfrac{0.25+0.36+0.01}{3}\\\tfrac{0.04+0.16+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.0967\\ 0.2067\\0.0967\end{bmatrix} </math>

'''Step 3.2: Compute column moments (<math>C_t</math>)'''

The process is same as row moments.

<math>C_t = \hat{\beta}\cdot C_{{t-1}} + (1-\hat{\beta})\cdot (\tfrac{1}{n}\sum_{j=1}^n G^{2}_t[i,j]+\epsilon_1) </math>

Column-wise mean (<math>C_t</math>):

<math>C_1 = \begin{bmatrix} \tfrac{0.09+025+0.04}{3} \\ \tfrac{0.04+0.36+0.16}{3}\\\tfrac{0.16+0.01+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.1267\\ 0.1867\\0.0867\end{bmatrix} </math>

'''Step 3.3: Second Moment Estimate ('''<math>\hat{V_t}</math>''')'''

The Second Moment Estimate is calculated as the outer product of the row moments ('''<math>R_t</math>''') and column moments ('''<math>C_t</math>''').

<math>\hat{V}_t = R_t \otimes C_t</math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0967\\0.2067\\0.0967 \end{bmatrix} \otimes \begin{bmatrix} 0.1267&0.1867&0.0867\\ \end{bmatrix} </math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\ 0.0122&0.0180&0.0084\end{bmatrix} </math>

'''<big>Step 4: Update the vector (<math>U_t </math>)</big>'''

Computed by scaling the gradient matrix '''<math>G_t</math>''' element-wise with the inverse square root of the second moment estimate (<math>\hat{V_t}</math>)

'''step 4.1: Find the vector value of <math>U_t </math>'''

Formula of '''<math>U_t </math>'''

<math>U_t = \frac{G_t}{\sqrt{\hat{V_t}+\epsilon_1}} </math>

Substitute '''<math>C_t</math>''' and <math>V_t</math>

<math>U_1 = \frac{\begin{bmatrix}0.3&-0.2&0.4 \\ -0.5&0.6&-0.1\\0.2&-0.4&0.3 \end{bmatrix}}{\sqrt{\begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\0.0122&0.0180&0.0084 \end{bmatrix}}} </math>

<math>U_1 = \begin{bmatrix} 2.711&-1.489&4.370\\-3.090&3.055&-0.747\\1.807&-2.978&3.278 \end{bmatrix} </math>

'''step 4.2: Clipped Update Vector <math>\hat{U_t} </math>'''

Scale the update vector ( '''<math>U_t </math>''' ) to ensure its RMS value does not exceed a predefined clipping threshold (<math>d </math>), maintaining stability in updates.

Formula of '''<math>\hat{U_t} </math>'''

'''<math>\hat{U_t} = \frac{U_t}{max(1,\tfrac{RMS(U_t)}{d}) } </math>'''

Compute RMS of '''<math>U_t </math>'''

'''<math>RMS(U_1) = \sqrt{\tfrac{1}{9} \sum_{i=1}^9 U_t[i]^2} \approx 3.303 </math>'''

Since RMS('''<math>U_t </math>''')>d, scale '''<math>U_t </math>''' by <math>\tfrac{1}{3.303} </math>

'''<math>\hat{U_1} = \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>'''

'''<big>Step 5: Weight Update (</big>'''<math>X_1 </math>'''<big>)</big>'''

Adjust the weights (<math>X_t </math>) by subtracting the product of the learning rate (<math>\alpha_t </math>) and the clipped update vector (<math>\hat{U_t} </math> ).

<math>X_1 = X_0 - \alpha \cdot \hat{U_t}</math>

The result for first iteration.

<math>X_1 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix} - 0.00806 \cdot \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>

<math>X_1 = \begin{bmatrix} 0.692&-0.496&0.887 \\-1.091&0.791&-0.596\\ 1.195&-0.691&0.391\end{bmatrix} </math>

== Applications ==
Adafactor is an efficient adaptive optimizer designed specifically for large-scale deep learning tasks. Its unique memory-saving properties have made it widely used for training large-scale language models, image recognition models, and reinforcement learning policy networks. Compared to other optimizers (e.g., Adam), Adafactor delivers exceptional performance in large-scale computations while significantly reducing memory requirements. Below are several specific application scenarios of Adafactor:

'''1. Natural Language Processing (NLP)'''

In NLP tasks, Adafactor has been successfully applied to training ultra-large-scale language models, such as Google’s Transformer and T5 (Text-To-Text Transfer Transformer). By significantly reducing memory usage during the gradient update process, Adafactor enables efficient model training in resource-constrained environments. For example, the T5 model in Google’s research employed Adafactor to effectively train on large datasets through text-to-text conversion tasks.2

'''2. Training Large-Scale Language Models'''

Adafactor has been used to train large-scale language models like LLaMA, combining it with novel preconditioned diagonalization methods to significantly enhance training efficiency. Experiments showed that Adafactor achieved performance comparable to the Adam optimizer while consuming substantially less memory and computational resources.3

'''3. Humor Detection Tasks'''

Adafactor has been utilized to optimize ALBERT-based models for humor detection tasks. Configured as an adaptive learning rate optimizer and paired with a cross-entropy loss function, Adafactor was used to train models that achieved 99% accuracy and F1 scores. Moreover, training time was faster than with Adam, completing in approximately 43 minutes. Comparisons with Adam and AdaBound optimizers demonstrated that Adafactor excelled in terms of both time efficiency and performance, especially in accuracy, recall, and F1 scores for humor detection tasks .4

'''4. Multilingual Model Training'''

In training multilingual models, Adafactor improved scalability and efficiency, particularly by significantly reducing memory consumption when handling large-scale parameters.5

'''5. Pretraining Vision Models'''

When training ResNet50 and ViT on the ImageNet1k dataset, Adafactor successfully optimized these deep networks with its low memory requirements. Additionally, with new algorithms combining preconditioned diagonalization methods (e.g., AdafacDiag and AdafacDiag++), it outperformed the standard Adam optimizer in both convergence speed and final accuracy.6

==== '''Software Tools and Platforms''' ====
Adafactor has been integrated into the following mainstream deep learning frameworks, making it accessible to developers:

'''TensorFlow''': Provides a built-in implementation of Adafactor.7

'''PyTorch:''' PyTorch provides the Adafactor optimizer through the torch.optim.AdaFactor class.8

'''JAX/Flax:''' JAX provides an optimizer library called Optax, which includes the Adafactor optimizer.9

==== '''Future Prospects''' ====
As the scale of deep learning models continues to grow, Adafactor’s memory-saving and computational efficiency advantages will become increasingly important. In the training of ultra-large-scale models (e.g., GPT and Vision Transformers), Adafactor is expected to become an indispensable optimization tool. Furthermore, by combining with other optimization strategies, such as mixed precision training, Adafactor may further enhance its applicability in both industrial and research settings.

== Conclusion ==
Adafactor addresses the memory consumption challenge of training large-scale deep learning models. By factorizing the second-order moment matrix and dynamically adjusting the learning rate, Adafactor minimizes resource usage without compromising performance. Adafactor can be applied to the training tasks of large language models such as Transformers, T5 models, and Vision Transformers.

== Reference ==

# Shazeer, Noam, and Mitchell Stern. "Adafactor: Adaptive learning rates with sublinear memory cost." ''International Conference on Machine Learning''. PMLR, 2018.
# Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." ''Journal of machine learning research'' 21.140 (2020): 1-67.
# DIAGONALIZATION, VIA PRECONDITIONER. "Improving Adaptive Moment Optimization via Preconditioner Diagonalization."
# Chauhan, Tavishee, and Hemant Palivela. "The Fine tuning of Language models for automation of Humor Detection." ''INFOCOMP Journal of Computer Science'' 20.2 (2021).
# Lepikhin, Dmitry, et al. "Gshard: Scaling giant models with conditional computation and automatic sharding." ''arXiv preprint arXiv:2006.16668'' (2020).
# DIAGONALIZATION, VIA PRECONDITIONER. "Improving Adaptive Moment Optimization via Preconditioner Diagonalization."
# <nowiki>https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adafactor</nowiki>
# <nowiki>https://pytorch.org/docs/stable/generated/torch.optim.Adafactor.html</nowiki>
# <nowiki>https://flax.readthedocs.io/en/v0.5.3/_autosummary/flax.optim.Adafactor.html</nowiki>

Adafactor

2024-12-13T21:51:26Z

Fall2024 Wiki Team6: /* Conclusion */

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

== Introduction ==
Adafactor is an efficient, adaptive learning rate optimization algorithm proposed by Noam Shazeer and Mitchell Stern in 2018. 1

Unlike traditional Adam optimizers, Adafactor does not store complete second-order moment matrices. Instead, it employs a factorization approach that only maintains gradient statistics for the rows and columns of parameter matrices, significantly reducing memory usage. Moreover, Adafactor uses an adaptive learning rate, allowing it to dynamically adjust step sizes without the need for manually setting a global learning rate or relying heavily on hyperparameter tuning. Its design also defaults to not performing bias correction, yet it remains stable in scenarios involving large-batch training data.1 This efficiency makes it an ideal choice for training ultra-large-scale models such as T5.2

Adafactor’s efficient memory usage and outstanding performance make it widely applicable in scenarios such as Natural Language Processing (NLP).2 Compared to the Adam optimizer, Adafactor significantly reduces memory and computational resource requirements while maintaining comparable performance when training large-scale language models and vision models. 3,6

== Problem formulation ==
=== 1. Objective ===
Minimize the loss function <math>f(x)</math>, where <math>x \in R^n</math> and <math>x</math> is the weight vector to be optimized.

=== 2. Parameters ===
*''' Gradient:'''
<math>G_t = \nabla f(x_{t-1})</math>

* '''Second moment estimate:'''

<math> \hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>

* '''Where:'''
** <math>\hat{V}_t</math> is the running average of the squared gradient.
**<math>\hat{\beta}_{2t}</math> is the corrected decay parameter.
**<math>\epsilon_1</math> is a regularization constant.

* '''Step size:'''
<math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math>
* '''Where''':
** <math>\rho_t</math> is the relative step size.
** <math>\epsilon_2</math> is a regularization constant.
** <math>\text{RMS}</math> is the root mean square, defined as:
*** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
*** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>

=== 3. Algorithms ===
==== Adafactor for Weighted Vectors ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^n</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update second moment estimate: <math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

==== Adafactor for Weighted Matrices ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^{n \times m}</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update row-wise second moment: <math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
** Update column-wise second moment: <math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
** Update overall second moment estimate: <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

=== Proposed Hyperparameters for Adafactor ===

* '''Regularization constant 1 (<math>\epsilon_1</math>):''' <math>10^{-30}</math>
**Ensures numerical stability by preventing division by zero in the calculation of second-moment estimates. This value is set extremely low to avoid instability in calculations.

* '''Regularization constant 2 (<math>\epsilon_2</math>):''' <math>10^{-3}</math>
**Helps stabilize parameter updates by controlling the scaling effect of second-moments in low-magnitude scenarios. This prevents instability caused by noise in small gradients.

* '''Clipping threshold (<math>d</math>):''' <math>1</math>
**A clipping threshold of 1 ensures stability by limiting large gradient values while maintaining sufficient learning efficiency. This avoids excessive suppression of large gradients, which could hinder learning.

* '''Relative step size (<math>\rho_t</math>):''' <math>\min(10^{-2}, 1 / \sqrt{t})</math>
**The <math>\min(10^{-2}, ...)</math> term caps the learning rate at <math>10^{-2}</math>, an empirically determined upper bound.
**The <math>1 / \sqrt{t}</math> term ensures convergence by reducing the step size over time, balancing exploration during early iterations with stability later in training.

* '''Second moment decay (<math>\hat{\beta}_{2t}</math>):''' <math>1 - t^{-0.8}</math>
**The decay factor remains close to 1 initially to allow rapid adaptation.
**The <math>t^{-0.8}</math> power balances between rapid learning in early training and stability during later stages, ensuring smoother convergence.

=== 5. Discussion ===

==== Why Clipping ====
Adafactor employs clipping to maintain numerical stability, especially since it is designed for use with very large models and often works with unscaled learning rates.
* Clipping prevents the update step from becoming very large, which would destabilize training
* Clipping mitigates the effects of very large gradients preventing numerical instability
Therefore, implementing clipping helps ensure stability and efficient training without requiring per-parameter scaling like Adam.

==== Why Adafactor is more memory efficient, compared to Adam ====
'''Row-wise and Column-wise Second Moment Updates'''
*<math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
*<math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
Instead of storing the full <math>G_t^2</math>, Adafactor computes the row and column respectively, which reduces the memory requirements from <math>O(n\times m)</math> to <math>O(n + m)</math>

'''Factored Representation of the Second Moment'''
* <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
This updates the second momentum based on the outer product <math>R_t C_t</math>.
*However, this is not <math>O(n\times m)</math> since
** The operation is performed element-wise, so it actually never materializes <math>\hat{V_t}</math> as a <math>n\times n</math> matrix
** It also only storing <math>R_t</math>and <math> C_t</math> instead of storage the full second-moment matrix

== Numerical Examples ==
Step-by-step instructions for determining the result of the first iteration.

'''<big>Problem setup</big>'''

'''Initial weights ('''<math>X_0</math>'''):'''

<math>X_0 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}</math>

'''Gradient for first iteration (<math>G_1</math>):'''

Gradient of the loss function with respect to X

<math>G_1 = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math>

'''<big>Hyperparameters setup</big>'''

<math>\epsilon_1 = 10^{-30}</math> (Minimum learning rate scaling factor))

<math>\epsilon_2 = 10^{-3}</math> (Regularization constant)

<math>d = 1</math> (Clipping threshold)

<math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> (Relative step size)

<math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> (Second moment decay)

'''<big>Step 1: Learning Rate Scaling</big>'''

Define the relative step size

<math>\rho_1 = \min(10^{-2}, 1/\sqrt{1})= 10^{-2}</math>

'''Step 1.1: Root Mean Square(RMS) calculation for <math>X_0</math>'''

Root Mean Square(RMS) calculation for <math>X_0</math>

RMS formula

<math>RMS(X_0) = \sqrt{\tfrac{1}{n}\sum_{i=1}^n X_0[i]^2}</math>

Substitute the initial weights

<math>RMS(X_0) = \sqrt{\tfrac{1}{9}(0.72^2+(-0.5)^2+0.9^2+(-1.1)^2+0.8^2+(-0.6)^2+1.2^2+(-0.7)^2+0.4^2)}</math>

<math>RMS(X_0) = \sqrt{\frac{6.85}{9}}\approx 0.806</math>

'''Step 1.2: Find the Learning Rate Scaling ('''<math>\alpha_t</math>'''):'''

Learning rate formula

<math>\alpha_1 = max(\epsilon_2,RMS(X_0))\cdot p_1</math>

Substitute the RMS

<math>\alpha_1 = max(0.001,0.806)\cdot 0.01=0.00806</math>

'''<big>Step 2: Compute <math>G^{2}_t</math> (Element-wise Square of Gradient)</big>'''

Compute the squared value of each element in the gradient matrix '''<math>G_t</math>'''.

<math>G^{2}_1 = \begin{bmatrix} 0.3^2&(-0.2)^2&0.4^2\\ (-0.5)^2&0.6^2&(-0.1)^2\\0.2^2&(-0.4)^2 &0.3^2 \end{bmatrix}</math>

<math>G^{2}_1 = \begin{bmatrix} 0.09& 0.04&0.16\\ 0.25&0.36&0.01\\0.04&0.16&0.09\end{bmatrix}</math>

'''<big>Step 3: Find the moment estimate</big>'''

Compute the exponential moving average of squared gradients to capture the variance or scale of gradients.

'''Step 3.1: Compute row moments (<math>R_t</math>)'''

This equation computes the row-wise second moments ('''<math>R_t</math>''' ) as an exponential moving average of past moments ('''<math>R_{t-1}</math>''') and the current row-wise mean of squared gradients ( <math>G^{2}_t</math> ), with a balance controlled by (<math>\hat{\beta}_{2t}</math>).

For <math>G^{2}_t=\mathbb{R}^{m\times n} </math>

<math>R_t = \hat{\beta_{2t}} \cdot R_{t-1} + (1-\hat{\beta})\cdot (\tfrac{1}{m}\sum_{j=1}^m G^{2}_t[i,j]+\epsilon_1) </math>

Since <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>, for first iteration: <math>\hat{\beta}_{21} = 0</math>. And because <math>\epsilon_1 </math> is too small, we can ignore it. The update of '''<math>R_t</math>''' is:

<math>R_{1} = \tfrac{1}{m}\textstyle \sum_{j=1}^m \displaystyle G^{2}_1[i,j] </math>

Row-wise mean ('''<math>R_t</math>'''):

<math>R_1 = \begin{bmatrix} \tfrac{0.09+0.04+0.16}{3} \\ \tfrac{0.25+0.36+0.01}{3}\\\tfrac{0.04+0.16+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.0967\\ 0.2067\\0.0967\end{bmatrix} </math>

'''Step 3.2: Compute column moments (<math>C_t</math>)'''

The process is same as row moments.

<math>C_t = \hat{\beta}\cdot C_{{t-1}} + (1-\hat{\beta})\cdot (\tfrac{1}{n}\sum_{j=1}^n G^{2}_t[i,j]+\epsilon_1) </math>

Column-wise mean (<math>C_t</math>):

<math>C_1 = \begin{bmatrix} \tfrac{0.09+025+0.04}{3} \\ \tfrac{0.04+0.36+0.16}{3}\\\tfrac{0.16+0.01+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.1267\\ 0.1867\\0.0867\end{bmatrix} </math>

'''Step 3.3: Second Moment Estimate ('''<math>\hat{V_t}</math>''')'''

The Second Moment Estimate is calculated as the outer product of the row moments ('''<math>R_t</math>''') and column moments ('''<math>C_t</math>''').

<math>\hat{V}_t = R_t \otimes C_t</math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0967\\0.2067\\0.0967 \end{bmatrix} \otimes \begin{bmatrix} 0.1267&0.1867&0.0867\\ \end{bmatrix} </math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\ 0.0122&0.0180&0.0084\end{bmatrix} </math>

'''<big>Step 4: Update the vector (<math>U_t </math>)</big>'''

Computed by scaling the gradient matrix '''<math>G_t</math>''' element-wise with the inverse square root of the second moment estimate (<math>\hat{V_t}</math>)

'''step 4.1: Find the vector value of <math>U_t </math>'''

Formula of '''<math>U_t </math>'''

<math>U_t = \frac{G_t}{\sqrt{\hat{V_t}+\epsilon_1}} </math>

Substitute '''<math>C_t</math>''' and <math>V_t</math>

<math>U_1 = \frac{\begin{bmatrix}0.3&-0.2&0.4 \\ -0.5&0.6&-0.1\\0.2&-0.4&0.3 \end{bmatrix}}{\sqrt{\begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\0.0122&0.0180&0.0084 \end{bmatrix}}} </math>

<math>U_1 = \begin{bmatrix} 2.711&-1.489&4.370\\-3.090&3.055&-0.747\\1.807&-2.978&3.278 \end{bmatrix} </math>

'''step 4.2: Clipped Update Vector <math>\hat{U_t} </math>'''

Scale the update vector ( '''<math>U_t </math>''' ) to ensure its RMS value does not exceed a predefined clipping threshold (<math>d </math>), maintaining stability in updates.

Formula of '''<math>\hat{U_t} </math>'''

'''<math>\hat{U_t} = \frac{U_t}{max(1,\tfrac{RMS(U_t)}{d}) } </math>'''

Compute RMS of '''<math>U_t </math>'''

'''<math>RMS(U_1) = \sqrt{\tfrac{1}{9} \sum_{i=1}^9 U_t[i]^2} \approx 3.303 </math>'''

Since RMS('''<math>U_t </math>''')>d, scale '''<math>U_t </math>''' by <math>\tfrac{1}{3.303} </math>

'''<math>\hat{U_1} = \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>'''

'''<big>Step 5: Weight Update (</big>'''<math>X_1 </math>'''<big>)</big>'''

Adjust the weights (<math>X_t </math>) by subtracting the product of the learning rate (<math>\alpha_t </math>) and the clipped update vector (<math>\hat{U_t} </math> ).

<math>X_1 = X_0 - \alpha \cdot \hat{U_t}</math>

The result for first iteration.

<math>X_1 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix} - 0.00806 \cdot \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>

<math>X_1 = \begin{bmatrix} 0.692&-0.496&0.887 \\-1.091&0.791&-0.596\\ 1.195&-0.691&0.391\end{bmatrix} </math>

== Applications ==
Adafactor is an efficient adaptive optimizer designed specifically for large-scale deep learning tasks. Its unique memory-saving properties have made it widely used for training large-scale language models, image recognition models, and reinforcement learning policy networks. Compared to other optimizers (e.g., Adam), Adafactor delivers exceptional performance in large-scale computations while significantly reducing memory requirements. Below are several specific application scenarios of Adafactor:

'''1. Natural Language Processing (NLP)'''

In NLP tasks, Adafactor has been successfully applied to training ultra-large-scale language models, such as Google’s Transformer and T5 (Text-To-Text Transfer Transformer). By significantly reducing memory usage during the gradient update process, Adafactor enables efficient model training in resource-constrained environments. For example, the T5 model in Google’s research employed Adafactor to effectively train on large datasets through text-to-text conversion tasks.2

'''2. Training Large-Scale Language Models'''

Adafactor has been used to train large-scale language models like LLaMA, combining it with novel preconditioned diagonalization methods to significantly enhance training efficiency. Experiments showed that Adafactor achieved performance comparable to the Adam optimizer while consuming substantially less memory and computational resources.3

'''3. Humor Detection Tasks'''

Adafactor has been utilized to optimize ALBERT-based models for humor detection tasks. Configured as an adaptive learning rate optimizer and paired with a cross-entropy loss function, Adafactor was used to train models that achieved 99% accuracy and F1 scores. Moreover, training time was faster than with Adam, completing in approximately 43 minutes. Comparisons with Adam and AdaBound optimizers demonstrated that Adafactor excelled in terms of both time efficiency and performance, especially in accuracy, recall, and F1 scores for humor detection tasks .4

'''4. Multilingual Model Training'''

In training multilingual models, Adafactor improved scalability and efficiency, particularly by significantly reducing memory consumption when handling large-scale parameters.5

'''5. Pretraining Vision Models'''

When training ResNet50 and ViT on the ImageNet1k dataset, Adafactor successfully optimized these deep networks with its low memory requirements. Additionally, with new algorithms combining preconditioned diagonalization methods (e.g., AdafacDiag and AdafacDiag++), it outperformed the standard Adam optimizer in both convergence speed and final accuracy.6

=== '''Software Tools and Platforms''' ===
Adafactor has been integrated into the following mainstream deep learning frameworks, making it accessible to developers:

'''TensorFlow''': Provides a built-in implementation of Adafactor.7

'''PyTorch:''' PyTorch provides the Adafactor optimizer through the torch.optim.AdaFactor class.8

'''JAX/Flax:''' JAX provides an optimizer library called Optax, which includes the Adafactor optimizer.9

=== '''Future Prospects''' ===
As the scale of deep learning models continues to grow, Adafactor’s memory-saving and computational efficiency advantages will become increasingly important. In the training of ultra-large-scale models (e.g., GPT and Vision Transformers), Adafactor is expected to become an indispensable optimization tool. Furthermore, by combining with other optimization strategies, such as mixed precision training, Adafactor may further enhance its applicability in both industrial and research settings.

== Conclusion ==
Adafactor addresses the memory consumption challenge of training large-scale deep learning models. By factorizing the second-order moment matrix and dynamically adjusting the learning rate, Adafactor minimizes resource usage without compromising performance. Adafactor can be applied to the training tasks of large language models such as Transformers, T5 models, and Vision Transformers.

== Reference ==

# Shazeer, Noam, and Mitchell Stern. "Adafactor: Adaptive learning rates with sublinear memory cost." ''International Conference on Machine Learning''. PMLR, 2018.
# Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." ''Journal of machine learning research'' 21.140 (2020): 1-67.
# DIAGONALIZATION, VIA PRECONDITIONER. "Improving Adaptive Moment Optimization via Preconditioner Diagonalization."
# Chauhan, Tavishee, and Hemant Palivela. "The Fine tuning of Language models for automation of Humor Detection." ''INFOCOMP Journal of Computer Science'' 20.2 (2021).
# Lepikhin, Dmitry, et al. "Gshard: Scaling giant models with conditional computation and automatic sharding." ''arXiv preprint arXiv:2006.16668'' (2020).
# DIAGONALIZATION, VIA PRECONDITIONER. "Improving Adaptive Moment Optimization via Preconditioner Diagonalization."
# <nowiki>https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adafactor</nowiki>
# <nowiki>https://pytorch.org/docs/stable/generated/torch.optim.Adafactor.html</nowiki>
# <nowiki>https://flax.readthedocs.io/en/v0.5.3/_autosummary/flax.optim.Adafactor.html</nowiki>

Adafactor

2024-12-13T21:48:11Z

Fall2024 Wiki Team6: /* Applications */ change tensorflow

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

== Introduction ==
Adafactor is an efficient, adaptive learning rate optimization algorithm proposed by Noam Shazeer and Mitchell Stern in 2018. 1

Unlike traditional Adam optimizers, Adafactor does not store complete second-order moment matrices. Instead, it employs a factorization approach that only maintains gradient statistics for the rows and columns of parameter matrices, significantly reducing memory usage. Moreover, Adafactor uses an adaptive learning rate, allowing it to dynamically adjust step sizes without the need for manually setting a global learning rate or relying heavily on hyperparameter tuning. Its design also defaults to not performing bias correction, yet it remains stable in scenarios involving large-batch training data.1 This efficiency makes it an ideal choice for training ultra-large-scale models such as T5.2

Adafactor’s efficient memory usage and outstanding performance make it widely applicable in scenarios such as Natural Language Processing (NLP).2 Compared to the Adam optimizer, Adafactor significantly reduces memory and computational resource requirements while maintaining comparable performance when training large-scale language models and vision models. 3,6

== Problem formulation ==
=== 1. Objective ===
Minimize the loss function <math>f(x)</math>, where <math>x \in R^n</math> and <math>x</math> is the weight vector to be optimized.

=== 2. Parameters ===
*''' Gradient:'''
<math>G_t = \nabla f(x_{t-1})</math>

* '''Second moment estimate:'''

<math> \hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>

* '''Where:'''
** <math>\hat{V}_t</math> is the running average of the squared gradient.
**<math>\hat{\beta}_{2t}</math> is the corrected decay parameter.
**<math>\epsilon_1</math> is a regularization constant.

* '''Step size:'''
<math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math>
* '''Where''':
** <math>\rho_t</math> is the relative step size.
** <math>\epsilon_2</math> is a regularization constant.
** <math>\text{RMS}</math> is the root mean square, defined as:
*** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
*** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>

=== 3. Algorithms ===
==== Adafactor for Weighted Vectors ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^n</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update second moment estimate: <math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

==== Adafactor for Weighted Matrices ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^{n \times m}</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update row-wise second moment: <math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
** Update column-wise second moment: <math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
** Update overall second moment estimate: <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

=== Proposed Hyperparameters for Adafactor ===

* '''Regularization constant 1 (<math>\epsilon_1</math>):''' <math>10^{-30}</math>
**Ensures numerical stability by preventing division by zero in the calculation of second-moment estimates. This value is set extremely low to avoid instability in calculations.

* '''Regularization constant 2 (<math>\epsilon_2</math>):''' <math>10^{-3}</math>
**Helps stabilize parameter updates by controlling the scaling effect of second-moments in low-magnitude scenarios. This prevents instability caused by noise in small gradients.

* '''Clipping threshold (<math>d</math>):''' <math>1</math>
**A clipping threshold of 1 ensures stability by limiting large gradient values while maintaining sufficient learning efficiency. This avoids excessive suppression of large gradients, which could hinder learning.

* '''Relative step size (<math>\rho_t</math>):''' <math>\min(10^{-2}, 1 / \sqrt{t})</math>
**The <math>\min(10^{-2}, ...)</math> term caps the learning rate at <math>10^{-2}</math>, an empirically determined upper bound.
**The <math>1 / \sqrt{t}</math> term ensures convergence by reducing the step size over time, balancing exploration during early iterations with stability later in training.

* '''Second moment decay (<math>\hat{\beta}_{2t}</math>):''' <math>1 - t^{-0.8}</math>
**The decay factor remains close to 1 initially to allow rapid adaptation.
**The <math>t^{-0.8}</math> power balances between rapid learning in early training and stability during later stages, ensuring smoother convergence.

=== 5. Discussion ===

==== Why Clipping ====
Adafactor employs clipping to maintain numerical stability, especially since it is designed for use with very large models and often works with unscaled learning rates.
* Clipping prevents the update step from becoming very large, which would destabilize training
* Clipping mitigates the effects of very large gradients preventing numerical instability
Therefore, implementing clipping helps ensure stability and efficient training without requiring per-parameter scaling like Adam.

==== Why Adafactor is more memory efficient, compared to Adam ====
'''Row-wise and Column-wise Second Moment Updates'''
*<math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
*<math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
Instead of storing the full <math>G_t^2</math>, Adafactor computes the row and column respectively, which reduces the memory requirements from <math>O(n\times m)</math> to <math>O(n + m)</math>

'''Factored Representation of the Second Moment'''
* <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
This updates the second momentum based on the outer product <math>R_t C_t</math>.
*However, this is not <math>O(n\times m)</math> since
** The operation is performed element-wise, so it actually never materializes <math>\hat{V_t}</math> as a <math>n\times n</math> matrix
** It also only storing <math>R_t</math>and <math> C_t</math> instead of storage the full second-moment matrix

== Numerical Examples ==
Step-by-step instructions for determining the result of the first iteration.

'''<big>Problem setup</big>'''

'''Initial weights ('''<math>X_0</math>'''):'''

<math>X_0 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}</math>

'''Gradient for first iteration (<math>G_1</math>):'''

Gradient of the loss function with respect to X

<math>G_1 = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math>

'''<big>Hyperparameters setup</big>'''

<math>\epsilon_1 = 10^{-30}</math> (Minimum learning rate scaling factor))

<math>\epsilon_2 = 10^{-3}</math> (Regularization constant)

<math>d = 1</math> (Clipping threshold)

<math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> (Relative step size)

<math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> (Second moment decay)

'''<big>Step 1: Learning Rate Scaling</big>'''

Define the relative step size

<math>\rho_1 = \min(10^{-2}, 1/\sqrt{1})= 10^{-2}</math>

'''Step 1.1: Root Mean Square(RMS) calculation for <math>X_0</math>'''

Root Mean Square(RMS) calculation for <math>X_0</math>

RMS formula

<math>RMS(X_0) = \sqrt{\tfrac{1}{n}\sum_{i=1}^n X_0[i]^2}</math>

Substitute the initial weights

<math>RMS(X_0) = \sqrt{\tfrac{1}{9}(0.72^2+(-0.5)^2+0.9^2+(-1.1)^2+0.8^2+(-0.6)^2+1.2^2+(-0.7)^2+0.4^2)}</math>

<math>RMS(X_0) = \sqrt{\frac{6.85}{9}}\approx 0.806</math>

'''Step 1.2: Find the Learning Rate Scaling ('''<math>\alpha_t</math>'''):'''

Learning rate formula

<math>\alpha_1 = max(\epsilon_2,RMS(X_0))\cdot p_1</math>

Substitute the RMS

<math>\alpha_1 = max(0.001,0.806)\cdot 0.01=0.00806</math>

'''<big>Step 2: Compute <math>G^{2}_t</math> (Element-wise Square of Gradient)</big>'''

Compute the squared value of each element in the gradient matrix '''<math>G_t</math>'''.

<math>G^{2}_1 = \begin{bmatrix} 0.3^2&(-0.2)^2&0.4^2\\ (-0.5)^2&0.6^2&(-0.1)^2\\0.2^2&(-0.4)^2 &0.3^2 \end{bmatrix}</math>

<math>G^{2}_1 = \begin{bmatrix} 0.09& 0.04&0.16\\ 0.25&0.36&0.01\\0.04&0.16&0.09\end{bmatrix}</math>

'''<big>Step 3: Find the moment estimate</big>'''

Compute the exponential moving average of squared gradients to capture the variance or scale of gradients.

'''Step 3.1: Compute row moments (<math>R_t</math>)'''

This equation computes the row-wise second moments ('''<math>R_t</math>''' ) as an exponential moving average of past moments ('''<math>R_{t-1}</math>''') and the current row-wise mean of squared gradients ( <math>G^{2}_t</math> ), with a balance controlled by (<math>\hat{\beta}_{2t}</math>).

For <math>G^{2}_t=\mathbb{R}^{m\times n} </math>

<math>R_t = \hat{\beta_{2t}} \cdot R_{t-1} + (1-\hat{\beta})\cdot (\tfrac{1}{m}\sum_{j=1}^m G^{2}_t[i,j]+\epsilon_1) </math>

Since <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>, for first iteration: <math>\hat{\beta}_{21} = 0</math>. And because <math>\epsilon_1 </math> is too small, we can ignore it. The update of '''<math>R_t</math>''' is:

<math>R_{1} = \tfrac{1}{m}\textstyle \sum_{j=1}^m \displaystyle G^{2}_1[i,j] </math>

Row-wise mean ('''<math>R_t</math>'''):

<math>R_1 = \begin{bmatrix} \tfrac{0.09+0.04+0.16}{3} \\ \tfrac{0.25+0.36+0.01}{3}\\\tfrac{0.04+0.16+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.0967\\ 0.2067\\0.0967\end{bmatrix} </math>

'''Step 3.2: Compute column moments (<math>C_t</math>)'''

The process is same as row moments.

<math>C_t = \hat{\beta}\cdot C_{{t-1}} + (1-\hat{\beta})\cdot (\tfrac{1}{n}\sum_{j=1}^n G^{2}_t[i,j]+\epsilon_1) </math>

Column-wise mean (<math>C_t</math>):

<math>C_1 = \begin{bmatrix} \tfrac{0.09+025+0.04}{3} \\ \tfrac{0.04+0.36+0.16}{3}\\\tfrac{0.16+0.01+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.1267\\ 0.1867\\0.0867\end{bmatrix} </math>

'''Step 3.3: Second Moment Estimate ('''<math>\hat{V_t}</math>''')'''

The Second Moment Estimate is calculated as the outer product of the row moments ('''<math>R_t</math>''') and column moments ('''<math>C_t</math>''').

<math>\hat{V}_t = R_t \otimes C_t</math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0967\\0.2067\\0.0967 \end{bmatrix} \otimes \begin{bmatrix} 0.1267&0.1867&0.0867\\ \end{bmatrix} </math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\ 0.0122&0.0180&0.0084\end{bmatrix} </math>

'''<big>Step 4: Update the vector (<math>U_t </math>)</big>'''

Computed by scaling the gradient matrix '''<math>G_t</math>''' element-wise with the inverse square root of the second moment estimate (<math>\hat{V_t}</math>)

'''step 4.1: Find the vector value of <math>U_t </math>'''

Formula of '''<math>U_t </math>'''

<math>U_t = \frac{G_t}{\sqrt{\hat{V_t}+\epsilon_1}} </math>

Substitute '''<math>C_t</math>''' and <math>V_t</math>

<math>U_1 = \frac{\begin{bmatrix}0.3&-0.2&0.4 \\ -0.5&0.6&-0.1\\0.2&-0.4&0.3 \end{bmatrix}}{\sqrt{\begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\0.0122&0.0180&0.0084 \end{bmatrix}}} </math>

<math>U_1 = \begin{bmatrix} 2.711&-1.489&4.370\\-3.090&3.055&-0.747\\1.807&-2.978&3.278 \end{bmatrix} </math>

'''step 4.2: Clipped Update Vector <math>\hat{U_t} </math>'''

Scale the update vector ( '''<math>U_t </math>''' ) to ensure its RMS value does not exceed a predefined clipping threshold (<math>d </math>), maintaining stability in updates.

Formula of '''<math>\hat{U_t} </math>'''

'''<math>\hat{U_t} = \frac{U_t}{max(1,\tfrac{RMS(U_t)}{d}) } </math>'''

Compute RMS of '''<math>U_t </math>'''

'''<math>RMS(U_1) = \sqrt{\tfrac{1}{9} \sum_{i=1}^9 U_t[i]^2} \approx 3.303 </math>'''

Since RMS('''<math>U_t </math>''')>d, scale '''<math>U_t </math>''' by <math>\tfrac{1}{3.303} </math>

'''<math>\hat{U_1} = \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>'''

'''<big>Step 5: Weight Update (</big>'''<math>X_1 </math>'''<big>)</big>'''

Adjust the weights (<math>X_t </math>) by subtracting the product of the learning rate (<math>\alpha_t </math>) and the clipped update vector (<math>\hat{U_t} </math> ).

<math>X_1 = X_0 - \alpha \cdot \hat{U_t}</math>

The result for first iteration.

<math>X_1 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix} - 0.00806 \cdot \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>

<math>X_1 = \begin{bmatrix} 0.692&-0.496&0.887 \\-1.091&0.791&-0.596\\ 1.195&-0.691&0.391\end{bmatrix} </math>

== Applications ==
Adafactor is an efficient adaptive optimizer designed specifically for large-scale deep learning tasks. Its unique memory-saving properties have made it widely used for training large-scale language models, image recognition models, and reinforcement learning policy networks. Compared to other optimizers (e.g., Adam), Adafactor delivers exceptional performance in large-scale computations while significantly reducing memory requirements. Below are several specific application scenarios of Adafactor:

'''1. Natural Language Processing (NLP)'''

In NLP tasks, Adafactor has been successfully applied to training ultra-large-scale language models, such as Google’s Transformer and T5 (Text-To-Text Transfer Transformer). By significantly reducing memory usage during the gradient update process, Adafactor enables efficient model training in resource-constrained environments. For example, the T5 model in Google’s research employed Adafactor to effectively train on large datasets through text-to-text conversion tasks.2

'''2. Training Large-Scale Language Models'''

Adafactor has been used to train large-scale language models like LLaMA, combining it with novel preconditioned diagonalization methods to significantly enhance training efficiency. Experiments showed that Adafactor achieved performance comparable to the Adam optimizer while consuming substantially less memory and computational resources.3

'''3. Humor Detection Tasks'''

Adafactor has been utilized to optimize ALBERT-based models for humor detection tasks. Configured as an adaptive learning rate optimizer and paired with a cross-entropy loss function, Adafactor was used to train models that achieved 99% accuracy and F1 scores. Moreover, training time was faster than with Adam, completing in approximately 43 minutes. Comparisons with Adam and AdaBound optimizers demonstrated that Adafactor excelled in terms of both time efficiency and performance, especially in accuracy, recall, and F1 scores for humor detection tasks .4

'''4. Multilingual Model Training'''

In training multilingual models, Adafactor improved scalability and efficiency, particularly by significantly reducing memory consumption when handling large-scale parameters.5

'''5. Pretraining Vision Models'''

When training ResNet50 and ViT on the ImageNet1k dataset, Adafactor successfully optimized these deep networks with its low memory requirements. Additionally, with new algorithms combining preconditioned diagonalization methods (e.g., AdafacDiag and AdafacDiag++), it outperformed the standard Adam optimizer in both convergence speed and final accuracy.6

=== '''Software Tools and Platforms''' ===
Adafactor has been integrated into the following mainstream deep learning frameworks, making it accessible to developers:

'''TensorFlow''': Provides a built-in implementation of Adafactor.7

'''PyTorch:''' PyTorch provides the Adafactor optimizer through the torch.optim.AdaFactor class.8

'''JAX/Flax:''' JAX provides an optimizer library called Optax, which includes the Adafactor optimizer.9

=== '''Future Prospects''' ===
As the scale of deep learning models continues to grow, Adafactor’s memory-saving and computational efficiency advantages will become increasingly important. In the training of ultra-large-scale models (e.g., GPT and Vision Transformers), Adafactor is expected to become an indispensable optimization tool. Furthermore, by combining with other optimization strategies, such as mixed precision training, Adafactor may further enhance its applicability in both industrial and research settings.

== Conclusion ==
== Reference ==

Adafactor

2024-12-12T16:58:34Z

Fall2024 Wiki Team6: /* Proposed Hyperparameters for Adafactor */

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

== Introduction ==
Adafactor is an efficient, adaptive learning rate optimization algorithm proposed by Noam Shazeer and Mitchell Stern in 2018. 1

Unlike traditional Adam optimizers, Adafactor does not store complete second-order moment matrices. Instead, it employs a factorization approach that only maintains gradient statistics for the rows and columns of parameter matrices, significantly reducing memory usage. Moreover, Adafactor uses an adaptive learning rate, allowing it to dynamically adjust step sizes without the need for manually setting a global learning rate or relying heavily on hyperparameter tuning. Its design also defaults to not performing bias correction, yet it remains stable in scenarios involving large-batch training data.1 This efficiency makes it an ideal choice for training ultra-large-scale models such as T5.2

Adafactor’s efficient memory usage and outstanding performance make it widely applicable in scenarios such as Natural Language Processing (NLP).2 Compared to the Adam optimizer, Adafactor significantly reduces memory and computational resource requirements while maintaining comparable performance when training large-scale language models and vision models. 3,6

== Problem formulation ==
=== 1. Objective ===
Minimize the loss function <math>f(x)</math>, where <math>x \in R^n</math> and <math>x</math> is the weight vector to be optimized.

=== 2. Parameters ===
*''' Gradient:'''
<math>G_t = \nabla f(x_{t-1})</math>

* '''Second moment estimate:'''

<math> \hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>

* '''Where:'''
** <math>\hat{V}_t</math> is the running average of the squared gradient.
**<math>\hat{\beta}_{2t}</math> is the corrected decay parameter.
**<math>\epsilon_1</math> is a regularization constant.

* '''Step size:'''
<math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math>
* '''Where''':
** <math>\rho_t</math> is the relative step size.
** <math>\epsilon_2</math> is a regularization constant.
** <math>\text{RMS}</math> is the root mean square, defined as:
*** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
*** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>

=== 3. Algorithms ===
==== Adafactor for Weighted Vectors ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^n</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update second moment estimate: <math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

==== Adafactor for Weighted Matrices ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^{n \times m}</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update row-wise second moment: <math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
** Update column-wise second moment: <math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
** Update overall second moment estimate: <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

=== Proposed Hyperparameters for Adafactor ===

* '''Regularization constant 1 (<math>\epsilon_1</math>):''' <math>10^{-30}</math>
**Ensures numerical stability by preventing division by zero in the calculation of second-moment estimates. This value is set extremely low to avoid instability in calculations.

* '''Regularization constant 2 (<math>\epsilon_2</math>):''' <math>10^{-3}</math>
**Helps stabilize parameter updates by controlling the scaling effect of second-moments in low-magnitude scenarios. This prevents instability caused by noise in small gradients.

* '''Clipping threshold (<math>d</math>):''' <math>1</math>
**A clipping threshold of 1 ensures stability by limiting large gradient values while maintaining sufficient learning efficiency. This avoids excessive suppression of large gradients, which could hinder learning.

* '''Relative step size (<math>\rho_t</math>):''' <math>\min(10^{-2}, 1 / \sqrt{t})</math>
**The <math>\min(10^{-2}, ...)</math> term caps the learning rate at <math>10^{-2}</math>, an empirically determined upper bound.
**The <math>1 / \sqrt{t}</math> term ensures convergence by reducing the step size over time, balancing exploration during early iterations with stability later in training.

* '''Second moment decay (<math>\hat{\beta}_{2t}</math>):''' <math>1 - t^{-0.8}</math>
**The decay factor remains close to 1 initially to allow rapid adaptation.
**The <math>t^{-0.8}</math> power balances between rapid learning in early training and stability during later stages, ensuring smoother convergence.

=== 5. Discussion ===

==== Why Clipping ====
Adafactor employs clipping to maintain numerical stability, especially since it is designed for use with very large models and often works with unscaled learning rates.
* Clipping prevents the update step from becoming very large, which would destabilize training
* Clipping mitigates the effects of very large gradients preventing numerical instability
Therefore, implementing clipping helps ensure stability and efficient training without requiring per-parameter scaling like Adam.

==== Why Adafactor is more memory efficient, compared to Adam ====
'''Row-wise and Column-wise Second Moment Updates'''
*<math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
*<math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
Instead of storing the full <math>G_t^2</math>, Adafactor computes the row and column respectively, which reduces the memory requirements from <math>O(n\times m)</math> to <math>O(n + m)</math>

'''Factored Representation of the Second Moment'''
* <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
This updates the second momentum based on the outer product <math>R_t C_t</math>.
*However, this is not <math>O(n\times m)</math> since
** The operation is performed element-wise, so it actually never materializes <math>\hat{V_t}</math> as a <math>n\times n</math> matrix
** It also only storing <math>R_t</math>and <math> C_t</math> instead of storage the full second-moment matrix

== Numerical Examples ==
Step-by-step instructions for determining the result of the first iteration.

'''<big>Problem setup</big>'''

'''Initial weights ('''<math>X_0</math>'''):'''

<math>X_0 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}</math>

'''Gradient for first iteration (<math>G_1</math>):'''

Gradient of the loss function with respect to X

<math>G_1 = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math>

'''<big>Hyperparameters setup</big>'''

<math>\epsilon_1 = 10^{-30}</math> (Minimum learning rate scaling factor))

<math>\epsilon_2 = 10^{-3}</math> (Regularization constant)

<math>d = 1</math> (Clipping threshold)

<math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> (Relative step size)

<math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> (Second moment decay)

'''<big>Step 1: Learning Rate Scaling</big>'''

Define the relative step size

<math>\rho_1 = \min(10^{-2}, 1/\sqrt{1})= 10^{-2}</math>

'''Step 1.1: Root Mean Square(RMS) calculation for <math>X_0</math>'''

Root Mean Square(RMS) calculation for <math>X_0</math>

RMS formula

<math>RMS(X_0) = \sqrt{\tfrac{1}{n}\sum_{i=1}^n X_0[i]^2}</math>

Substitute the initial weights

<math>RMS(X_0) = \sqrt{\tfrac{1}{9}(0.72^2+(-0.5)^2+0.9^2+(-1.1)^2+0.8^2+(-0.6)^2+1.2^2+(-0.7)^2+0.4^2)}</math>

<math>RMS(X_0) = \sqrt{\frac{6.85}{9}}\approx 0.806</math>

'''Step 1.2: Find the Learning Rate Scaling ('''<math>\alpha_t</math>'''):'''

Learning rate formula

<math>\alpha_1 = max(\epsilon_2,RMS(X_0))\cdot p_1</math>

Substitute the RMS

<math>\alpha_1 = max(0.001,0.806)\cdot 0.01=0.00806</math>

'''<big>Step 2: Compute <math>G^{2}_t</math> (Element-wise Square of Gradient)</big>'''

Compute the squared value of each element in the gradient matrix '''<math>G_t</math>'''.

<math>G^{2}_1 = \begin{bmatrix} 0.3^2&(-0.2)^2&0.4^2\\ (-0.5)^2&0.6^2&(-0.1)^2\\0.2^2&(-0.4)^2 &0.3^2 \end{bmatrix}</math>

<math>G^{2}_1 = \begin{bmatrix} 0.09& 0.04&0.16\\ 0.25&0.36&0.01\\0.04&0.16&0.09\end{bmatrix}</math>

'''<big>Step 3: Find the moment estimate</big>'''

Compute the exponential moving average of squared gradients to capture the variance or scale of gradients.

'''Step 3.1: Compute row moments (<math>R_t</math>)'''

This equation computes the row-wise second moments ('''<math>R_t</math>''' ) as an exponential moving average of past moments ('''<math>R_{t-1}</math>''') and the current row-wise mean of squared gradients ( <math>G^{2}_t</math> ), with a balance controlled by (<math>\hat{\beta}_{2t}</math>).

For <math>G^{2}_t=\mathbb{R}^{m\times n} </math>

<math>R_t = \hat{\beta_{2t}} \cdot R_{t-1} + (1-\hat{\beta})\cdot (\tfrac{1}{m}\sum_{j=1}^m G^{2}_t[i,j]+\epsilon_1) </math>

Since <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>, for first iteration: <math>\hat{\beta}_{21} = 0</math>. And because <math>\epsilon_1 </math> is too small, we can ignore it. The update of '''<math>R_t</math>''' is:

<math>R_{1} = \tfrac{1}{m}\textstyle \sum_{j=1}^m \displaystyle G^{2}_1[i,j] </math>

Row-wise mean ('''<math>R_t</math>'''):

<math>R_1 = \begin{bmatrix} \tfrac{0.09+0.04+0.16}{3} \\ \tfrac{0.25+0.36+0.01}{3}\\\tfrac{0.04+0.16+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.0967\\ 0.2067\\0.0967\end{bmatrix} </math>

'''Step 3.2: Compute column moments (<math>C_t</math>)'''

The process is same as row moments.

<math>C_t = \hat{\beta}\cdot C_{{t-1}} + (1-\hat{\beta})\cdot (\tfrac{1}{n}\sum_{j=1}^n G^{2}_t[i,j]+\epsilon_1) </math>

Column-wise mean (<math>C_t</math>):

<math>C_1 = \begin{bmatrix} \tfrac{0.09+025+0.04}{3} \\ \tfrac{0.04+0.36+0.16}{3}\\\tfrac{0.16+0.01+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.1267\\ 0.1867\\0.0867\end{bmatrix} </math>

'''Step 3.3: Second Moment Estimate ('''<math>\hat{V_t}</math>''')'''

The Second Moment Estimate is calculated as the outer product of the row moments ('''<math>R_t</math>''') and column moments ('''<math>C_t</math>''').

<math>\hat{V}_t = R_t \otimes C_t</math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0967\\0.2067\\0.0967 \end{bmatrix} \otimes \begin{bmatrix} 0.1267&0.1867&0.0867\\ \end{bmatrix} </math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\ 0.0122&0.0180&0.0084\end{bmatrix} </math>

'''<big>Step 4: Update the vector (<math>U_t </math>)</big>'''

Computed by scaling the gradient matrix '''<math>G_t</math>''' element-wise with the inverse square root of the second moment estimate (<math>\hat{V_t}</math>)

'''step 4.1: Find the vector value of <math>U_t </math>'''

Formula of '''<math>U_t </math>'''

<math>U_t = \frac{G_t}{\sqrt{\hat{V_t}+\epsilon_1}} </math>

Substitute '''<math>C_t</math>''' and <math>V_t</math>

<math>U_1 = \frac{\begin{bmatrix}0.3&-0.2&0.4 \\ -0.5&0.6&-0.1\\0.2&-0.4&0.3 \end{bmatrix}}{\sqrt{\begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\0.0122&0.0180&0.0084 \end{bmatrix}}} </math>

<math>U_1 = \begin{bmatrix} 2.711&-1.489&4.370\\-3.090&3.055&-0.747\\1.807&-2.978&3.278 \end{bmatrix} </math>

'''step 4.2: Clipped Update Vector <math>\hat{U_t} </math>'''

Scale the update vector ( '''<math>U_t </math>''' ) to ensure its RMS value does not exceed a predefined clipping threshold (<math>d </math>), maintaining stability in updates.

Formula of '''<math>\hat{U_t} </math>'''

'''<math>\hat{U_t} = \frac{U_t}{max(1,\tfrac{RMS(U_t)}{d}) } </math>'''

Compute RMS of '''<math>U_t </math>'''

'''<math>RMS(U_1) = \sqrt{\tfrac{1}{9} \sum_{i=1}^9 U_t[i]^2} \approx 3.303 </math>'''

Since RMS('''<math>U_t </math>''')>d, scale '''<math>U_t </math>''' by <math>\tfrac{1}{3.303} </math>

'''<math>\hat{U_1} = \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>'''

'''<big>Step 5: Weight Update (</big>'''<math>X_1 </math>'''<big>)</big>'''

Adjust the weights (<math>X_t </math>) by subtracting the product of the learning rate (<math>\alpha_t </math>) and the clipped update vector (<math>\hat{U_t} </math> ).

<math>X_1 = X_0 - \alpha \cdot \hat{U_t}</math>

The result for first iteration.

<math>X_1 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix} - 0.00806 \cdot \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>

<math>X_1 = \begin{bmatrix} 0.692&-0.496&0.887 \\-1.091&0.791&-0.596\\ 1.195&-0.691&0.391\end{bmatrix} </math>

== Applications ==
Adafactor is an efficient adaptive optimizer designed specifically for large-scale deep learning tasks. Its unique memory-saving properties have made it widely used for training large-scale language models, image recognition models, and reinforcement learning policy networks. Compared to other optimizers (e.g., Adam), Adafactor delivers exceptional performance in large-scale computations while significantly reducing memory requirements. Below are several specific application scenarios of Adafactor:

'''1. Natural Language Processing (NLP)'''

In NLP tasks, Adafactor has been successfully applied to training ultra-large-scale language models, such as Google’s Transformer and T5 (Text-To-Text Transfer Transformer). By significantly reducing memory usage during the gradient update process, Adafactor enables efficient model training in resource-constrained environments. For example, the T5 model in Google’s research employed Adafactor to effectively train on large datasets through text-to-text conversion tasks.2

'''2. Training Large-Scale Language Models'''

Adafactor has been used to train large-scale language models like LLaMA, combining it with novel preconditioned diagonalization methods to significantly enhance training efficiency. Experiments showed that Adafactor achieved performance comparable to the Adam optimizer while consuming substantially less memory and computational resources.3

'''3. Humor Detection Tasks'''

Adafactor has been utilized to optimize ALBERT-based models for humor detection tasks. Configured as an adaptive learning rate optimizer and paired with a cross-entropy loss function, Adafactor was used to train models that achieved 99% accuracy and F1 scores. Moreover, training time was faster than with Adam, completing in approximately 43 minutes. Comparisons with Adam and AdaBound optimizers demonstrated that Adafactor excelled in terms of both time efficiency and performance, especially in accuracy, recall, and F1 scores for humor detection tasks .4

'''4. Multilingual Model Training'''

In training multilingual models, Adafactor improved scalability and efficiency, particularly by significantly reducing memory consumption when handling large-scale parameters.5

'''5. Pretraining Vision Models'''

When training ResNet50 and ViT on the ImageNet1k dataset, Adafactor successfully optimized these deep networks with its low memory requirements. Additionally, with new algorithms combining preconditioned diagonalization methods (e.g., AdafacDiag and AdafacDiag++), it outperformed the standard Adam optimizer in both convergence speed and final accuracy.6

=== '''Software Tools and Platforms''' ===
Adafactor has been integrated into the following mainstream deep learning frameworks, making it accessible to developers:

'''TensorFlow''': Provides a built-in implementation of Adafactor, supporting T5 model optimization.7

'''PyTorch:''' PyTorch provides the Adafactor optimizer through the torch.optim.AdaFactor class.8

'''JAX/Flax:''' JAX provides an optimizer library called Optax, which includes the Adafactor optimizer.9

=== '''Future Prospects''' ===
As the scale of deep learning models continues to grow, Adafactor’s memory-saving and computational efficiency advantages will become increasingly important. In the training of ultra-large-scale models (e.g., GPT and Vision Transformers), Adafactor is expected to become an indispensable optimization tool. Furthermore, by combining with other optimization strategies, such as mixed precision training, Adafactor may further enhance its applicability in both industrial and research settings.

== Conclusion ==
== Reference ==

Adafactor

2024-12-12T16:57:48Z

Fall2024 Wiki Team6: /* 4. Proposed Hyperparameters for Adafactor */

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

== Introduction ==
Adafactor is an efficient, adaptive learning rate optimization algorithm proposed by Noam Shazeer and Mitchell Stern in 2018. 1

Unlike traditional Adam optimizers, Adafactor does not store complete second-order moment matrices. Instead, it employs a factorization approach that only maintains gradient statistics for the rows and columns of parameter matrices, significantly reducing memory usage. Moreover, Adafactor uses an adaptive learning rate, allowing it to dynamically adjust step sizes without the need for manually setting a global learning rate or relying heavily on hyperparameter tuning. Its design also defaults to not performing bias correction, yet it remains stable in scenarios involving large-batch training data.1 This efficiency makes it an ideal choice for training ultra-large-scale models such as T5.2

Adafactor’s efficient memory usage and outstanding performance make it widely applicable in scenarios such as Natural Language Processing (NLP).2 Compared to the Adam optimizer, Adafactor significantly reduces memory and computational resource requirements while maintaining comparable performance when training large-scale language models and vision models. 3,6

== Problem formulation ==
=== 1. Objective ===
Minimize the loss function <math>f(x)</math>, where <math>x \in R^n</math> and <math>x</math> is the weight vector to be optimized.

=== 2. Parameters ===
*''' Gradient:'''
<math>G_t = \nabla f(x_{t-1})</math>

* '''Second moment estimate:'''

<math> \hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>

* '''Where:'''
** <math>\hat{V}_t</math> is the running average of the squared gradient.
**<math>\hat{\beta}_{2t}</math> is the corrected decay parameter.
**<math>\epsilon_1</math> is a regularization constant.

* '''Step size:'''
<math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math>
* '''Where''':
** <math>\rho_t</math> is the relative step size.
** <math>\epsilon_2</math> is a regularization constant.
** <math>\text{RMS}</math> is the root mean square, defined as:
*** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
*** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>

=== 3. Algorithms ===
==== Adafactor for Weighted Vectors ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^n</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update second moment estimate: <math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

==== Adafactor for Weighted Matrices ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^{n \times m}</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update row-wise second moment: <math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
** Update column-wise second moment: <math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
** Update overall second moment estimate: <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

=== Proposed Hyperparameters for Adafactor ===

* '''Regularization constant 1 (<math>\epsilon_1</math>):''' <math>10^{-30}</math>
Ensures numerical stability by preventing division by zero in the calculation of second-moment estimates. This value is set extremely low to avoid instability in calculations.

* '''Regularization constant 2 (<math>\epsilon_2</math>):''' <math>10^{-3}</math>
Helps stabilize parameter updates by controlling the scaling effect of second-moments in low-magnitude scenarios. This prevents instability caused by noise in small gradients.

* '''Clipping threshold (<math>d</math>):''' <math>1</math>
A clipping threshold of 1 ensures stability by limiting large gradient values while maintaining sufficient learning efficiency. This avoids excessive suppression of large gradients, which could hinder learning.

* '''Relative step size (<math>\rho_t</math>):''' <math>\min(10^{-2}, 1 / \sqrt{t})</math>
- The <math>\min(10^{-2}, ...)</math> term caps the learning rate at <math>10^{-2}</math>, an empirically determined upper bound.
- The <math>1 / \sqrt{t}</math> term ensures convergence by reducing the step size over time, balancing exploration during early iterations with stability later in training.

* '''Second moment decay (<math>\hat{\beta}_{2t}</math>):''' <math>1 - t^{-0.8}</math>
- The decay factor remains close to 1 initially to allow rapid adaptation.
- The <math>t^{-0.8}</math> power balances between rapid learning in early training and stability during later stages, ensuring smoother convergence.

=== 5. Discussion ===

==== Why Clipping ====
Adafactor employs clipping to maintain numerical stability, especially since it is designed for use with very large models and often works with unscaled learning rates.
* Clipping prevents the update step from becoming very large, which would destabilize training
* Clipping mitigates the effects of very large gradients preventing numerical instability
Therefore, implementing clipping helps ensure stability and efficient training without requiring per-parameter scaling like Adam.

==== Why Adafactor is more memory efficient, compared to Adam ====
'''Row-wise and Column-wise Second Moment Updates'''
*<math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
*<math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
Instead of storing the full <math>G_t^2</math>, Adafactor computes the row and column respectively, which reduces the memory requirements from <math>O(n\times m)</math> to <math>O(n + m)</math>

'''Factored Representation of the Second Moment'''
* <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
This updates the second momentum based on the outer product <math>R_t C_t</math>.
*However, this is not <math>O(n\times m)</math> since
** The operation is performed element-wise, so it actually never materializes <math>\hat{V_t}</math> as a <math>n\times n</math> matrix
** It also only storing <math>R_t</math>and <math> C_t</math> instead of storage the full second-moment matrix

== Numerical Examples ==
Step-by-step instructions for determining the result of the first iteration.

'''<big>Problem setup</big>'''

'''Initial weights ('''<math>X_0</math>'''):'''

<math>X_0 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}</math>

'''Gradient for first iteration (<math>G_1</math>):'''

Gradient of the loss function with respect to X

<math>G_1 = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math>

'''<big>Hyperparameters setup</big>'''

<math>\epsilon_1 = 10^{-30}</math> (Minimum learning rate scaling factor))

<math>\epsilon_2 = 10^{-3}</math> (Regularization constant)

<math>d = 1</math> (Clipping threshold)

<math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> (Relative step size)

<math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> (Second moment decay)

'''<big>Step 1: Learning Rate Scaling</big>'''

Define the relative step size

<math>\rho_1 = \min(10^{-2}, 1/\sqrt{1})= 10^{-2}</math>

'''Step 1.1: Root Mean Square(RMS) calculation for <math>X_0</math>'''

Root Mean Square(RMS) calculation for <math>X_0</math>

RMS formula

<math>RMS(X_0) = \sqrt{\tfrac{1}{n}\sum_{i=1}^n X_0[i]^2}</math>

Substitute the initial weights

<math>RMS(X_0) = \sqrt{\tfrac{1}{9}(0.72^2+(-0.5)^2+0.9^2+(-1.1)^2+0.8^2+(-0.6)^2+1.2^2+(-0.7)^2+0.4^2)}</math>

<math>RMS(X_0) = \sqrt{\frac{6.85}{9}}\approx 0.806</math>

'''Step 1.2: Find the Learning Rate Scaling ('''<math>\alpha_t</math>'''):'''

Learning rate formula

<math>\alpha_1 = max(\epsilon_2,RMS(X_0))\cdot p_1</math>

Substitute the RMS

<math>\alpha_1 = max(0.001,0.806)\cdot 0.01=0.00806</math>

'''<big>Step 2: Compute <math>G^{2}_t</math> (Element-wise Square of Gradient)</big>'''

Compute the squared value of each element in the gradient matrix '''<math>G_t</math>'''.

<math>G^{2}_1 = \begin{bmatrix} 0.3^2&(-0.2)^2&0.4^2\\ (-0.5)^2&0.6^2&(-0.1)^2\\0.2^2&(-0.4)^2 &0.3^2 \end{bmatrix}</math>

<math>G^{2}_1 = \begin{bmatrix} 0.09& 0.04&0.16\\ 0.25&0.36&0.01\\0.04&0.16&0.09\end{bmatrix}</math>

'''<big>Step 3: Find the moment estimate</big>'''

Compute the exponential moving average of squared gradients to capture the variance or scale of gradients.

'''Step 3.1: Compute row moments (<math>R_t</math>)'''

This equation computes the row-wise second moments ('''<math>R_t</math>''' ) as an exponential moving average of past moments ('''<math>R_{t-1}</math>''') and the current row-wise mean of squared gradients ( <math>G^{2}_t</math> ), with a balance controlled by (<math>\hat{\beta}_{2t}</math>).

For <math>G^{2}_t=\mathbb{R}^{m\times n} </math>

<math>R_t = \hat{\beta_{2t}} \cdot R_{t-1} + (1-\hat{\beta})\cdot (\tfrac{1}{m}\sum_{j=1}^m G^{2}_t[i,j]+\epsilon_1) </math>

Since <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>, for first iteration: <math>\hat{\beta}_{21} = 0</math>. And because <math>\epsilon_1 </math> is too small, we can ignore it. The update of '''<math>R_t</math>''' is:

<math>R_{1} = \tfrac{1}{m}\textstyle \sum_{j=1}^m \displaystyle G^{2}_1[i,j] </math>

Row-wise mean ('''<math>R_t</math>'''):

<math>R_1 = \begin{bmatrix} \tfrac{0.09+0.04+0.16}{3} \\ \tfrac{0.25+0.36+0.01}{3}\\\tfrac{0.04+0.16+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.0967\\ 0.2067\\0.0967\end{bmatrix} </math>

'''Step 3.2: Compute column moments (<math>C_t</math>)'''

The process is same as row moments.

<math>C_t = \hat{\beta}\cdot C_{{t-1}} + (1-\hat{\beta})\cdot (\tfrac{1}{n}\sum_{j=1}^n G^{2}_t[i,j]+\epsilon_1) </math>

Column-wise mean (<math>C_t</math>):

<math>C_1 = \begin{bmatrix} \tfrac{0.09+025+0.04}{3} \\ \tfrac{0.04+0.36+0.16}{3}\\\tfrac{0.16+0.01+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.1267\\ 0.1867\\0.0867\end{bmatrix} </math>

'''Step 3.3: Second Moment Estimate ('''<math>\hat{V_t}</math>''')'''

The Second Moment Estimate is calculated as the outer product of the row moments ('''<math>R_t</math>''') and column moments ('''<math>C_t</math>''').

<math>\hat{V}_t = R_t \otimes C_t</math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0967\\0.2067\\0.0967 \end{bmatrix} \otimes \begin{bmatrix} 0.1267&0.1867&0.0867\\ \end{bmatrix} </math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\ 0.0122&0.0180&0.0084\end{bmatrix} </math>

'''<big>Step 4: Update the vector (<math>U_t </math>)</big>'''

Computed by scaling the gradient matrix '''<math>G_t</math>''' element-wise with the inverse square root of the second moment estimate (<math>\hat{V_t}</math>)

'''step 4.1: Find the vector value of <math>U_t </math>'''

Formula of '''<math>U_t </math>'''

<math>U_t = \frac{G_t}{\sqrt{\hat{V_t}+\epsilon_1}} </math>

Substitute '''<math>C_t</math>''' and <math>V_t</math>

<math>U_1 = \frac{\begin{bmatrix}0.3&-0.2&0.4 \\ -0.5&0.6&-0.1\\0.2&-0.4&0.3 \end{bmatrix}}{\sqrt{\begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\0.0122&0.0180&0.0084 \end{bmatrix}}} </math>

<math>U_1 = \begin{bmatrix} 2.711&-1.489&4.370\\-3.090&3.055&-0.747\\1.807&-2.978&3.278 \end{bmatrix} </math>

'''step 4.2: Clipped Update Vector <math>\hat{U_t} </math>'''

Scale the update vector ( '''<math>U_t </math>''' ) to ensure its RMS value does not exceed a predefined clipping threshold (<math>d </math>), maintaining stability in updates.

Formula of '''<math>\hat{U_t} </math>'''

'''<math>\hat{U_t} = \frac{U_t}{max(1,\tfrac{RMS(U_t)}{d}) } </math>'''

Compute RMS of '''<math>U_t </math>'''

'''<math>RMS(U_1) = \sqrt{\tfrac{1}{9} \sum_{i=1}^9 U_t[i]^2} \approx 3.303 </math>'''

Since RMS('''<math>U_t </math>''')>d, scale '''<math>U_t </math>''' by <math>\tfrac{1}{3.303} </math>

'''<math>\hat{U_1} = \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>'''

'''<big>Step 5: Weight Update (</big>'''<math>X_1 </math>'''<big>)</big>'''

Adjust the weights (<math>X_t </math>) by subtracting the product of the learning rate (<math>\alpha_t </math>) and the clipped update vector (<math>\hat{U_t} </math> ).

<math>X_1 = X_0 - \alpha \cdot \hat{U_t}</math>

The result for first iteration.

<math>X_1 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix} - 0.00806 \cdot \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>

<math>X_1 = \begin{bmatrix} 0.692&-0.496&0.887 \\-1.091&0.791&-0.596\\ 1.195&-0.691&0.391\end{bmatrix} </math>

== Applications ==
Adafactor is an efficient adaptive optimizer designed specifically for large-scale deep learning tasks. Its unique memory-saving properties have made it widely used for training large-scale language models, image recognition models, and reinforcement learning policy networks. Compared to other optimizers (e.g., Adam), Adafactor delivers exceptional performance in large-scale computations while significantly reducing memory requirements. Below are several specific application scenarios of Adafactor:

'''1. Natural Language Processing (NLP)'''

In NLP tasks, Adafactor has been successfully applied to training ultra-large-scale language models, such as Google’s Transformer and T5 (Text-To-Text Transfer Transformer). By significantly reducing memory usage during the gradient update process, Adafactor enables efficient model training in resource-constrained environments. For example, the T5 model in Google’s research employed Adafactor to effectively train on large datasets through text-to-text conversion tasks.2

'''2. Training Large-Scale Language Models'''

Adafactor has been used to train large-scale language models like LLaMA, combining it with novel preconditioned diagonalization methods to significantly enhance training efficiency. Experiments showed that Adafactor achieved performance comparable to the Adam optimizer while consuming substantially less memory and computational resources.3

'''3. Humor Detection Tasks'''

Adafactor has been utilized to optimize ALBERT-based models for humor detection tasks. Configured as an adaptive learning rate optimizer and paired with a cross-entropy loss function, Adafactor was used to train models that achieved 99% accuracy and F1 scores. Moreover, training time was faster than with Adam, completing in approximately 43 minutes. Comparisons with Adam and AdaBound optimizers demonstrated that Adafactor excelled in terms of both time efficiency and performance, especially in accuracy, recall, and F1 scores for humor detection tasks .4

'''4. Multilingual Model Training'''

In training multilingual models, Adafactor improved scalability and efficiency, particularly by significantly reducing memory consumption when handling large-scale parameters.5

'''5. Pretraining Vision Models'''

When training ResNet50 and ViT on the ImageNet1k dataset, Adafactor successfully optimized these deep networks with its low memory requirements. Additionally, with new algorithms combining preconditioned diagonalization methods (e.g., AdafacDiag and AdafacDiag++), it outperformed the standard Adam optimizer in both convergence speed and final accuracy.6

=== '''Software Tools and Platforms''' ===
Adafactor has been integrated into the following mainstream deep learning frameworks, making it accessible to developers:

'''TensorFlow''': Provides a built-in implementation of Adafactor, supporting T5 model optimization.7

'''PyTorch:''' PyTorch provides the Adafactor optimizer through the torch.optim.AdaFactor class.8

'''JAX/Flax:''' JAX provides an optimizer library called Optax, which includes the Adafactor optimizer.9

=== '''Future Prospects''' ===
As the scale of deep learning models continues to grow, Adafactor’s memory-saving and computational efficiency advantages will become increasingly important. In the training of ultra-large-scale models (e.g., GPT and Vision Transformers), Adafactor is expected to become an indispensable optimization tool. Furthermore, by combining with other optimization strategies, such as mixed precision training, Adafactor may further enhance its applicability in both industrial and research settings.

== Conclusion ==
== Reference ==

Adafactor

2024-12-12T16:56:10Z

Fall2024 Wiki Team6: /* 4. Proposed Hyperparameters for Adafactor */

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

== Introduction ==
Adafactor is an efficient, adaptive learning rate optimization algorithm proposed by Noam Shazeer and Mitchell Stern in 2018. 1

Unlike traditional Adam optimizers, Adafactor does not store complete second-order moment matrices. Instead, it employs a factorization approach that only maintains gradient statistics for the rows and columns of parameter matrices, significantly reducing memory usage. Moreover, Adafactor uses an adaptive learning rate, allowing it to dynamically adjust step sizes without the need for manually setting a global learning rate or relying heavily on hyperparameter tuning. Its design also defaults to not performing bias correction, yet it remains stable in scenarios involving large-batch training data.1 This efficiency makes it an ideal choice for training ultra-large-scale models such as T5.2

Adafactor’s efficient memory usage and outstanding performance make it widely applicable in scenarios such as Natural Language Processing (NLP).2 Compared to the Adam optimizer, Adafactor significantly reduces memory and computational resource requirements while maintaining comparable performance when training large-scale language models and vision models. 3,6

== Problem formulation ==
=== 1. Objective ===
Minimize the loss function <math>f(x)</math>, where <math>x \in R^n</math> and <math>x</math> is the weight vector to be optimized.

=== 2. Parameters ===
*''' Gradient:'''
<math>G_t = \nabla f(x_{t-1})</math>

* '''Second moment estimate:'''

<math> \hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>

* '''Where:'''
** <math>\hat{V}_t</math> is the running average of the squared gradient.
**<math>\hat{\beta}_{2t}</math> is the corrected decay parameter.
**<math>\epsilon_1</math> is a regularization constant.

* '''Step size:'''
<math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math>
* '''Where''':
** <math>\rho_t</math> is the relative step size.
** <math>\epsilon_2</math> is a regularization constant.
** <math>\text{RMS}</math> is the root mean square, defined as:
*** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
*** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>

=== 3. Algorithms ===
==== Adafactor for Weighted Vectors ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^n</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update second moment estimate: <math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

==== Adafactor for Weighted Matrices ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^{n \times m}</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update row-wise second moment: <math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
** Update column-wise second moment: <math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
** Update overall second moment estimate: <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

=== 4. Proposed Hyperparameters for Adafactor ===

* '''Regularization constant 1 (\(\epsilon_1\)):''' \(10^{-30}\)
Ensures numerical stability by preventing division by zero in the calculation of second-moment estimates. This value is set extremely low to avoid instability in calculations.

* '''Regularization constant 2 (\(\epsilon_2\)):''' \(10^{-3}\)
Helps stabilize parameter updates by controlling the scaling effect of second-moments in low-magnitude scenarios. This prevents instability caused by noise in small gradients.

* '''Clipping threshold (\(d\)):''' \(1\)
A clipping threshold of 1 ensures stability by limiting large gradient values while maintaining sufficient learning efficiency. This avoids excessive suppression of large gradients, which could hinder learning.

* '''Relative step size (\(\rho_t\)):''' \(\min(10^{-2}, 1 / \sqrt{t})\)
- The \(\min(10^{-2}, ...)\) term caps the learning rate at \(10^{-2}\), an empirically determined upper bound.
- The \(1 / \sqrt{t}\) term ensures convergence by reducing the step size over time, balancing exploration during early iterations with stability later in training.

* '''Second moment decay (\(\hat{\beta}_{2t}\)):''' \(1 - t^{-0.8}\)
- The decay factor remains close to 1 initially to allow rapid adaptation.
- The \(t^{-0.8}\) power balances between rapid learning in early training and stability during later stages, ensuring smoother convergence.

=== 5. Discussion ===

==== Why Clipping ====
Adafactor employs clipping to maintain numerical stability, especially since it is designed for use with very large models and often works with unscaled learning rates.
* Clipping prevents the update step from becoming very large, which would destabilize training
* Clipping mitigates the effects of very large gradients preventing numerical instability
Therefore, implementing clipping helps ensure stability and efficient training without requiring per-parameter scaling like Adam.

==== Why Adafactor is more memory efficient, compared to Adam ====
'''Row-wise and Column-wise Second Moment Updates'''
*<math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
*<math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
Instead of storing the full <math>G_t^2</math>, Adafactor computes the row and column respectively, which reduces the memory requirements from <math>O(n\times m)</math> to <math>O(n + m)</math>

'''Factored Representation of the Second Moment'''
* <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
This updates the second momentum based on the outer product <math>R_t C_t</math>.
*However, this is not <math>O(n\times m)</math> since
** The operation is performed element-wise, so it actually never materializes <math>\hat{V_t}</math> as a <math>n\times n</math> matrix
** It also only storing <math>R_t</math>and <math> C_t</math> instead of storage the full second-moment matrix

== Numerical Examples ==
Step-by-step instructions for determining the result of the first iteration.

'''<big>Problem setup</big>'''

'''Initial weights ('''<math>X_0</math>'''):'''

<math>X_0 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}</math>

'''Gradient for first iteration (<math>G_1</math>):'''

Gradient of the loss function with respect to X

<math>G_1 = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math>

'''<big>Hyperparameters setup</big>'''

<math>\epsilon_1 = 10^{-30}</math> (Minimum learning rate scaling factor))

<math>\epsilon_2 = 10^{-3}</math> (Regularization constant)

<math>d = 1</math> (Clipping threshold)

<math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> (Relative step size)

<math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> (Second moment decay)

'''<big>Step 1: Learning Rate Scaling</big>'''

Define the relative step size

<math>\rho_1 = \min(10^{-2}, 1/\sqrt{1})= 10^{-2}</math>

'''Step 1.1: Root Mean Square(RMS) calculation for <math>X_0</math>'''

Root Mean Square(RMS) calculation for <math>X_0</math>

RMS formula

<math>RMS(X_0) = \sqrt{\tfrac{1}{n}\sum_{i=1}^n X_0[i]^2}</math>

Substitute the initial weights

<math>RMS(X_0) = \sqrt{\tfrac{1}{9}(0.72^2+(-0.5)^2+0.9^2+(-1.1)^2+0.8^2+(-0.6)^2+1.2^2+(-0.7)^2+0.4^2)}</math>

<math>RMS(X_0) = \sqrt{\frac{6.85}{9}}\approx 0.806</math>

'''Step 1.2: Find the Learning Rate Scaling ('''<math>\alpha_t</math>'''):'''

Learning rate formula

<math>\alpha_1 = max(\epsilon_2,RMS(X_0))\cdot p_1</math>

Substitute the RMS

<math>\alpha_1 = max(0.001,0.806)\cdot 0.01=0.00806</math>

'''<big>Step 2: Compute <math>G^{2}_t</math> (Element-wise Square of Gradient)</big>'''

Compute the squared value of each element in the gradient matrix '''<math>G_t</math>'''.

<math>G^{2}_1 = \begin{bmatrix} 0.3^2&(-0.2)^2&0.4^2\\ (-0.5)^2&0.6^2&(-0.1)^2\\0.2^2&(-0.4)^2 &0.3^2 \end{bmatrix}</math>

<math>G^{2}_1 = \begin{bmatrix} 0.09& 0.04&0.16\\ 0.25&0.36&0.01\\0.04&0.16&0.09\end{bmatrix}</math>

'''<big>Step 3: Find the moment estimate</big>'''

Compute the exponential moving average of squared gradients to capture the variance or scale of gradients.

'''Step 3.1: Compute row moments (<math>R_t</math>)'''

This equation computes the row-wise second moments ('''<math>R_t</math>''' ) as an exponential moving average of past moments ('''<math>R_{t-1}</math>''') and the current row-wise mean of squared gradients ( <math>G^{2}_t</math> ), with a balance controlled by (<math>\hat{\beta}_{2t}</math>).

For <math>G^{2}_t=\mathbb{R}^{m\times n} </math>

<math>R_t = \hat{\beta_{2t}} \cdot R_{t-1} + (1-\hat{\beta})\cdot (\tfrac{1}{m}\sum_{j=1}^m G^{2}_t[i,j]+\epsilon_1) </math>

Since <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>, for first iteration: <math>\hat{\beta}_{21} = 0</math>. And because <math>\epsilon_1 </math> is too small, we can ignore it. The update of '''<math>R_t</math>''' is:

<math>R_{1} = \tfrac{1}{m}\textstyle \sum_{j=1}^m \displaystyle G^{2}_1[i,j] </math>

Row-wise mean ('''<math>R_t</math>'''):

<math>R_1 = \begin{bmatrix} \tfrac{0.09+0.04+0.16}{3} \\ \tfrac{0.25+0.36+0.01}{3}\\\tfrac{0.04+0.16+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.0967\\ 0.2067\\0.0967\end{bmatrix} </math>

'''Step 3.2: Compute column moments (<math>C_t</math>)'''

The process is same as row moments.

<math>C_t = \hat{\beta}\cdot C_{{t-1}} + (1-\hat{\beta})\cdot (\tfrac{1}{n}\sum_{j=1}^n G^{2}_t[i,j]+\epsilon_1) </math>

Column-wise mean (<math>C_t</math>):

<math>C_1 = \begin{bmatrix} \tfrac{0.09+025+0.04}{3} \\ \tfrac{0.04+0.36+0.16}{3}\\\tfrac{0.16+0.01+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.1267\\ 0.1867\\0.0867\end{bmatrix} </math>

'''Step 3.3: Second Moment Estimate ('''<math>\hat{V_t}</math>''')'''

The Second Moment Estimate is calculated as the outer product of the row moments ('''<math>R_t</math>''') and column moments ('''<math>C_t</math>''').

<math>\hat{V}_t = R_t \otimes C_t</math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0967\\0.2067\\0.0967 \end{bmatrix} \otimes \begin{bmatrix} 0.1267&0.1867&0.0867\\ \end{bmatrix} </math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\ 0.0122&0.0180&0.0084\end{bmatrix} </math>

'''<big>Step 4: Update the vector (<math>U_t </math>)</big>'''

Computed by scaling the gradient matrix '''<math>G_t</math>''' element-wise with the inverse square root of the second moment estimate (<math>\hat{V_t}</math>)

'''step 4.1: Find the vector value of <math>U_t </math>'''

Formula of '''<math>U_t </math>'''

<math>U_t = \frac{G_t}{\sqrt{\hat{V_t}+\epsilon_1}} </math>

Substitute '''<math>C_t</math>''' and <math>V_t</math>

<math>U_1 = \frac{\begin{bmatrix}0.3&-0.2&0.4 \\ -0.5&0.6&-0.1\\0.2&-0.4&0.3 \end{bmatrix}}{\sqrt{\begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\0.0122&0.0180&0.0084 \end{bmatrix}}} </math>

<math>U_1 = \begin{bmatrix} 2.711&-1.489&4.370\\-3.090&3.055&-0.747\\1.807&-2.978&3.278 \end{bmatrix} </math>

'''step 4.2: Clipped Update Vector <math>\hat{U_t} </math>'''

Scale the update vector ( '''<math>U_t </math>''' ) to ensure its RMS value does not exceed a predefined clipping threshold (<math>d </math>), maintaining stability in updates.

Formula of '''<math>\hat{U_t} </math>'''

'''<math>\hat{U_t} = \frac{U_t}{max(1,\tfrac{RMS(U_t)}{d}) } </math>'''

Compute RMS of '''<math>U_t </math>'''

'''<math>RMS(U_1) = \sqrt{\tfrac{1}{9} \sum_{i=1}^9 U_t[i]^2} \approx 3.303 </math>'''

Since RMS('''<math>U_t </math>''')>d, scale '''<math>U_t </math>''' by <math>\tfrac{1}{3.303} </math>

'''<math>\hat{U_1} = \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>'''

'''<big>Step 5: Weight Update (</big>'''<math>X_1 </math>'''<big>)</big>'''

Adjust the weights (<math>X_t </math>) by subtracting the product of the learning rate (<math>\alpha_t </math>) and the clipped update vector (<math>\hat{U_t} </math> ).

<math>X_1 = X_0 - \alpha \cdot \hat{U_t}</math>

The result for first iteration.

<math>X_1 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix} - 0.00806 \cdot \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>

<math>X_1 = \begin{bmatrix} 0.692&-0.496&0.887 \\-1.091&0.791&-0.596\\ 1.195&-0.691&0.391\end{bmatrix} </math>

== Applications ==
Adafactor is an efficient adaptive optimizer designed specifically for large-scale deep learning tasks. Its unique memory-saving properties have made it widely used for training large-scale language models, image recognition models, and reinforcement learning policy networks. Compared to other optimizers (e.g., Adam), Adafactor delivers exceptional performance in large-scale computations while significantly reducing memory requirements. Below are several specific application scenarios of Adafactor:

'''1. Natural Language Processing (NLP)'''

In NLP tasks, Adafactor has been successfully applied to training ultra-large-scale language models, such as Google’s Transformer and T5 (Text-To-Text Transfer Transformer). By significantly reducing memory usage during the gradient update process, Adafactor enables efficient model training in resource-constrained environments. For example, the T5 model in Google’s research employed Adafactor to effectively train on large datasets through text-to-text conversion tasks.2

'''2. Training Large-Scale Language Models'''

Adafactor has been used to train large-scale language models like LLaMA, combining it with novel preconditioned diagonalization methods to significantly enhance training efficiency. Experiments showed that Adafactor achieved performance comparable to the Adam optimizer while consuming substantially less memory and computational resources.3

'''3. Humor Detection Tasks'''

Adafactor has been utilized to optimize ALBERT-based models for humor detection tasks. Configured as an adaptive learning rate optimizer and paired with a cross-entropy loss function, Adafactor was used to train models that achieved 99% accuracy and F1 scores. Moreover, training time was faster than with Adam, completing in approximately 43 minutes. Comparisons with Adam and AdaBound optimizers demonstrated that Adafactor excelled in terms of both time efficiency and performance, especially in accuracy, recall, and F1 scores for humor detection tasks .4

'''4. Multilingual Model Training'''

In training multilingual models, Adafactor improved scalability and efficiency, particularly by significantly reducing memory consumption when handling large-scale parameters.5

'''5. Pretraining Vision Models'''

When training ResNet50 and ViT on the ImageNet1k dataset, Adafactor successfully optimized these deep networks with its low memory requirements. Additionally, with new algorithms combining preconditioned diagonalization methods (e.g., AdafacDiag and AdafacDiag++), it outperformed the standard Adam optimizer in both convergence speed and final accuracy.6

=== '''Software Tools and Platforms''' ===
Adafactor has been integrated into the following mainstream deep learning frameworks, making it accessible to developers:

'''TensorFlow''': Provides a built-in implementation of Adafactor, supporting T5 model optimization.7

'''PyTorch:''' PyTorch provides the Adafactor optimizer through the torch.optim.AdaFactor class.8

'''JAX/Flax:''' JAX provides an optimizer library called Optax, which includes the Adafactor optimizer.9

=== '''Future Prospects''' ===
As the scale of deep learning models continues to grow, Adafactor’s memory-saving and computational efficiency advantages will become increasingly important. In the training of ultra-large-scale models (e.g., GPT and Vision Transformers), Adafactor is expected to become an indispensable optimization tool. Furthermore, by combining with other optimization strategies, such as mixed precision training, Adafactor may further enhance its applicability in both industrial and research settings.

== Conclusion ==
== Reference ==

Adafactor

2024-12-12T16:54:25Z

Fall2024 Wiki Team6: /* Problem formulation */

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

== Introduction ==
Adafactor is an efficient, adaptive learning rate optimization algorithm proposed by Noam Shazeer and Mitchell Stern in 2018. 1

Unlike traditional Adam optimizers, Adafactor does not store complete second-order moment matrices. Instead, it employs a factorization approach that only maintains gradient statistics for the rows and columns of parameter matrices, significantly reducing memory usage. Moreover, Adafactor uses an adaptive learning rate, allowing it to dynamically adjust step sizes without the need for manually setting a global learning rate or relying heavily on hyperparameter tuning. Its design also defaults to not performing bias correction, yet it remains stable in scenarios involving large-batch training data.1 This efficiency makes it an ideal choice for training ultra-large-scale models such as T5.2

Adafactor’s efficient memory usage and outstanding performance make it widely applicable in scenarios such as Natural Language Processing (NLP).2 Compared to the Adam optimizer, Adafactor significantly reduces memory and computational resource requirements while maintaining comparable performance when training large-scale language models and vision models. 3,6

== Problem formulation ==
=== 1. Objective ===
Minimize the loss function <math>f(x)</math>, where <math>x \in R^n</math> and <math>x</math> is the weight vector to be optimized.

=== 2. Parameters ===
*''' Gradient:'''
<math>G_t = \nabla f(x_{t-1})</math>

* '''Second moment estimate:'''

<math> \hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>

* '''Where:'''
** <math>\hat{V}_t</math> is the running average of the squared gradient.
**<math>\hat{\beta}_{2t}</math> is the corrected decay parameter.
**<math>\epsilon_1</math> is a regularization constant.

* '''Step size:'''
<math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math>
* '''Where''':
** <math>\rho_t</math> is the relative step size.
** <math>\epsilon_2</math> is a regularization constant.
** <math>\text{RMS}</math> is the root mean square, defined as:
*** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
*** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>

=== 3. Algorithms ===
==== Adafactor for Weighted Vectors ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^n</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update second moment estimate: <math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

==== Adafactor for Weighted Matrices ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^{n \times m}</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update row-wise second moment: <math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
** Update column-wise second moment: <math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
** Update overall second moment estimate: <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

=== 4. Proposed Hyperparameters for Adafactor ===
* Regularization constant 1: <math>\epsilon_1 = 10^{-30}</math>
* Regularization constant 2: <math>\epsilon_2 = 10^{-3}</math>
* Clipping threshold: <math>d = 1</math>
* Relative step size: <math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math>
* Second moment decay: <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>

=== 5. Discussion ===

==== Why Clipping ====
Adafactor employs clipping to maintain numerical stability, especially since it is designed for use with very large models and often works with unscaled learning rates.
* Clipping prevents the update step from becoming very large, which would destabilize training
* Clipping mitigates the effects of very large gradients preventing numerical instability
Therefore, implementing clipping helps ensure stability and efficient training without requiring per-parameter scaling like Adam.

==== Why Adafactor is more memory efficient, compared to Adam ====
'''Row-wise and Column-wise Second Moment Updates'''
*<math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
*<math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
Instead of storing the full <math>G_t^2</math>, Adafactor computes the row and column respectively, which reduces the memory requirements from <math>O(n\times m)</math> to <math>O(n + m)</math>

'''Factored Representation of the Second Moment'''
* <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
This updates the second momentum based on the outer product <math>R_t C_t</math>.
*However, this is not <math>O(n\times m)</math> since
** The operation is performed element-wise, so it actually never materializes <math>\hat{V_t}</math> as a <math>n\times n</math> matrix
** It also only storing <math>R_t</math>and <math> C_t</math> instead of storage the full second-moment matrix

== Numerical Examples ==
Step-by-step instructions for determining the result of the first iteration.

'''<big>Problem setup</big>'''

'''Initial weights ('''<math>X_0</math>'''):'''

<math>X_0 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}</math>

'''Gradient for first iteration (<math>G_1</math>):'''

Gradient of the loss function with respect to X

<math>G_1 = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math>

'''<big>Hyperparameters setup</big>'''

<math>\epsilon_1 = 10^{-30}</math> (Minimum learning rate scaling factor))

<math>\epsilon_2 = 10^{-3}</math> (Regularization constant)

<math>d = 1</math> (Clipping threshold)

<math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> (Relative step size)

<math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> (Second moment decay)

'''<big>Step 1: Learning Rate Scaling</big>'''

Define the relative step size

<math>\rho_1 = \min(10^{-2}, 1/\sqrt{1})= 10^{-2}</math>

'''Step 1.1: Root Mean Square(RMS) calculation for <math>X_0</math>'''

Root Mean Square(RMS) calculation for <math>X_0</math>

RMS formula

<math>RMS(X_0) = \sqrt{\tfrac{1}{n}\sum_{i=1}^n X_0[i]^2}</math>

Substitute the initial weights

<math>RMS(X_0) = \sqrt{\tfrac{1}{9}(0.72^2+(-0.5)^2+0.9^2+(-1.1)^2+0.8^2+(-0.6)^2+1.2^2+(-0.7)^2+0.4^2)}</math>

<math>RMS(X_0) = \sqrt{\frac{6.85}{9}}\approx 0.806</math>

'''Step 1.2: Find the Learning Rate Scaling ('''<math>\alpha_t</math>'''):'''

Learning rate formula

<math>\alpha_1 = max(\epsilon_2,RMS(X_0))\cdot p_1</math>

Substitute the RMS

<math>\alpha_1 = max(0.001,0.806)\cdot 0.01=0.00806</math>

'''<big>Step 2: Compute <math>G^{2}_t</math> (Element-wise Square of Gradient)</big>'''

Compute the squared value of each element in the gradient matrix '''<math>G_t</math>'''.

<math>G^{2}_1 = \begin{bmatrix} 0.3^2&(-0.2)^2&0.4^2\\ (-0.5)^2&0.6^2&(-0.1)^2\\0.2^2&(-0.4)^2 &0.3^2 \end{bmatrix}</math>

<math>G^{2}_1 = \begin{bmatrix} 0.09& 0.04&0.16\\ 0.25&0.36&0.01\\0.04&0.16&0.09\end{bmatrix}</math>

'''<big>Step 3: Find the moment estimate</big>'''

Compute the exponential moving average of squared gradients to capture the variance or scale of gradients.

'''Step 3.1: Compute row moments (<math>R_t</math>)'''

This equation computes the row-wise second moments ('''<math>R_t</math>''' ) as an exponential moving average of past moments ('''<math>R_{t-1}</math>''') and the current row-wise mean of squared gradients ( <math>G^{2}_t</math> ), with a balance controlled by (<math>\hat{\beta}_{2t}</math>).

For <math>G^{2}_t=\mathbb{R}^{m\times n} </math>

<math>R_t = \hat{\beta_{2t}} \cdot R_{t-1} + (1-\hat{\beta})\cdot (\tfrac{1}{m}\sum_{j=1}^m G^{2}_t[i,j]+\epsilon_1) </math>

Since <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>, for first iteration: <math>\hat{\beta}_{21} = 0</math>. And because <math>\epsilon_1 </math> is too small, we can ignore it. The update of '''<math>R_t</math>''' is:

<math>R_{1} = \tfrac{1}{m}\textstyle \sum_{j=1}^m \displaystyle G^{2}_1[i,j] </math>

Row-wise mean ('''<math>R_t</math>'''):

<math>R_1 = \begin{bmatrix} \tfrac{0.09+0.04+0.16}{3} \\ \tfrac{0.25+0.36+0.01}{3}\\\tfrac{0.04+0.16+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.0967\\ 0.2067\\0.0967\end{bmatrix} </math>

'''Step 3.2: Compute column moments (<math>C_t</math>)'''

The process is same as row moments.

<math>C_t = \hat{\beta}\cdot C_{{t-1}} + (1-\hat{\beta})\cdot (\tfrac{1}{n}\sum_{j=1}^n G^{2}_t[i,j]+\epsilon_1) </math>

Column-wise mean (<math>C_t</math>):

<math>C_1 = \begin{bmatrix} \tfrac{0.09+025+0.04}{3} \\ \tfrac{0.04+0.36+0.16}{3}\\\tfrac{0.16+0.01+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.1267\\ 0.1867\\0.0867\end{bmatrix} </math>

'''Step 3.3: Second Moment Estimate ('''<math>\hat{V_t}</math>''')'''

The Second Moment Estimate is calculated as the outer product of the row moments ('''<math>R_t</math>''') and column moments ('''<math>C_t</math>''').

<math>\hat{V}_t = R_t \otimes C_t</math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0967\\0.2067\\0.0967 \end{bmatrix} \otimes \begin{bmatrix} 0.1267&0.1867&0.0867\\ \end{bmatrix} </math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\ 0.0122&0.0180&0.0084\end{bmatrix} </math>

'''<big>Step 4: Update the vector (<math>U_t </math>)</big>'''

Computed by scaling the gradient matrix '''<math>G_t</math>''' element-wise with the inverse square root of the second moment estimate (<math>\hat{V_t}</math>)

'''step 4.1: Find the vector value of <math>U_t </math>'''

Formula of '''<math>U_t </math>'''

<math>U_t = \frac{G_t}{\sqrt{\hat{V_t}+\epsilon_1}} </math>

Substitute '''<math>C_t</math>''' and <math>V_t</math>

<math>U_1 = \frac{\begin{bmatrix}0.3&-0.2&0.4 \\ -0.5&0.6&-0.1\\0.2&-0.4&0.3 \end{bmatrix}}{\sqrt{\begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\0.0122&0.0180&0.0084 \end{bmatrix}}} </math>

<math>U_1 = \begin{bmatrix} 2.711&-1.489&4.370\\-3.090&3.055&-0.747\\1.807&-2.978&3.278 \end{bmatrix} </math>

'''step 4.2: Clipped Update Vector <math>\hat{U_t} </math>'''

Scale the update vector ( '''<math>U_t </math>''' ) to ensure its RMS value does not exceed a predefined clipping threshold (<math>d </math>), maintaining stability in updates.

Formula of '''<math>\hat{U_t} </math>'''

'''<math>\hat{U_t} = \frac{U_t}{max(1,\tfrac{RMS(U_t)}{d}) } </math>'''

Compute RMS of '''<math>U_t </math>'''

'''<math>RMS(U_1) = \sqrt{\tfrac{1}{9} \sum_{i=1}^9 U_t[i]^2} \approx 3.303 </math>'''

Since RMS('''<math>U_t </math>''')>d, scale '''<math>U_t </math>''' by <math>\tfrac{1}{3.303} </math>

'''<math>\hat{U_1} = \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>'''

'''<big>Step 5: Weight Update (</big>'''<math>X_1 </math>'''<big>)</big>'''

Adjust the weights (<math>X_t </math>) by subtracting the product of the learning rate (<math>\alpha_t </math>) and the clipped update vector (<math>\hat{U_t} </math> ).

<math>X_1 = X_0 - \alpha \cdot \hat{U_t}</math>

The result for first iteration.

<math>X_1 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix} - 0.00806 \cdot \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>

<math>X_1 = \begin{bmatrix} 0.692&-0.496&0.887 \\-1.091&0.791&-0.596\\ 1.195&-0.691&0.391\end{bmatrix} </math>

== Applications ==
Adafactor is an efficient adaptive optimizer designed specifically for large-scale deep learning tasks. Its unique memory-saving properties have made it widely used for training large-scale language models, image recognition models, and reinforcement learning policy networks. Compared to other optimizers (e.g., Adam), Adafactor delivers exceptional performance in large-scale computations while significantly reducing memory requirements. Below are several specific application scenarios of Adafactor:

'''1. Natural Language Processing (NLP)'''

In NLP tasks, Adafactor has been successfully applied to training ultra-large-scale language models, such as Google’s Transformer and T5 (Text-To-Text Transfer Transformer). By significantly reducing memory usage during the gradient update process, Adafactor enables efficient model training in resource-constrained environments. For example, the T5 model in Google’s research employed Adafactor to effectively train on large datasets through text-to-text conversion tasks.2

'''2. Training Large-Scale Language Models'''

Adafactor has been used to train large-scale language models like LLaMA, combining it with novel preconditioned diagonalization methods to significantly enhance training efficiency. Experiments showed that Adafactor achieved performance comparable to the Adam optimizer while consuming substantially less memory and computational resources.3

'''3. Humor Detection Tasks'''

Adafactor has been utilized to optimize ALBERT-based models for humor detection tasks. Configured as an adaptive learning rate optimizer and paired with a cross-entropy loss function, Adafactor was used to train models that achieved 99% accuracy and F1 scores. Moreover, training time was faster than with Adam, completing in approximately 43 minutes. Comparisons with Adam and AdaBound optimizers demonstrated that Adafactor excelled in terms of both time efficiency and performance, especially in accuracy, recall, and F1 scores for humor detection tasks .4

'''4. Multilingual Model Training'''

In training multilingual models, Adafactor improved scalability and efficiency, particularly by significantly reducing memory consumption when handling large-scale parameters.5

'''5. Pretraining Vision Models'''

When training ResNet50 and ViT on the ImageNet1k dataset, Adafactor successfully optimized these deep networks with its low memory requirements. Additionally, with new algorithms combining preconditioned diagonalization methods (e.g., AdafacDiag and AdafacDiag++), it outperformed the standard Adam optimizer in both convergence speed and final accuracy.6

=== '''Software Tools and Platforms''' ===
Adafactor has been integrated into the following mainstream deep learning frameworks, making it accessible to developers:

'''TensorFlow''': Provides a built-in implementation of Adafactor, supporting T5 model optimization.7

'''PyTorch:''' PyTorch provides the Adafactor optimizer through the torch.optim.AdaFactor class.8

'''JAX/Flax:''' JAX provides an optimizer library called Optax, which includes the Adafactor optimizer.9

=== '''Future Prospects''' ===
As the scale of deep learning models continues to grow, Adafactor’s memory-saving and computational efficiency advantages will become increasingly important. In the training of ultra-large-scale models (e.g., GPT and Vision Transformers), Adafactor is expected to become an indispensable optimization tool. Furthermore, by combining with other optimization strategies, such as mixed precision training, Adafactor may further enhance its applicability in both industrial and research settings.

== Conclusion ==
== Reference ==

Adafactor

2024-12-12T16:53:01Z

Fall2024 Wiki Team6: Undo revision 6932 by Fall2024 Wiki Team6 (talk)

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

== Introduction ==
Adafactor is an efficient, adaptive learning rate optimization algorithm proposed by Noam Shazeer and Mitchell Stern in 2018. 1

Unlike traditional Adam optimizers, Adafactor does not store complete second-order moment matrices. Instead, it employs a factorization approach that only maintains gradient statistics for the rows and columns of parameter matrices, significantly reducing memory usage. Moreover, Adafactor uses an adaptive learning rate, allowing it to dynamically adjust step sizes without the need for manually setting a global learning rate or relying heavily on hyperparameter tuning. Its design also defaults to not performing bias correction, yet it remains stable in scenarios involving large-batch training data.1 This efficiency makes it an ideal choice for training ultra-large-scale models such as T5.2

Adafactor’s efficient memory usage and outstanding performance make it widely applicable in scenarios such as Natural Language Processing (NLP).2 Compared to the Adam optimizer, Adafactor significantly reduces memory and computational resource requirements while maintaining comparable performance when training large-scale language models and vision models. 3,6

== Problem formulation ==
=== 1. Objective ===
Minimize the loss function <math>f(x)</math>, where <math>x \in R^n</math> and <math>x</math> is the weight vector to be optimized.

=== 2. Parameters ===
*''' Gradient:'''
<math>G_t = \nabla f(x_{t-1})</math>

* '''Second moment estimate:'''

<math> \hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>

* '''Where:'''
** <math>\hat{V}_t</math> is the running average of the squared gradient.
**<math>\hat{\beta}_{2t}</math> is the corrected decay parameter.
**<math>\epsilon_1</math> is a regularization constant.

* '''Step size:'''
<math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math>
* '''Where''':
** <math>\rho_t</math> is the relative step size.
** <math>\epsilon_2</math> is a regularization constant.
** <math>\text{RMS}</math> is the root mean square, defined as:
*** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
*** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>

=== 3. Algorithms ===
==== Adafactor for Weighted Vectors ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^n</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update second moment estimate: <math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

==== Adafactor for Weighted Matrices ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^{n \times m}</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update row-wise second moment: <math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
** Update column-wise second moment: <math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
** Update overall second moment estimate: <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

=== Why Clipping ===
Adafactor employs clipping to maintain numerical stability, especially since it is designed for use with very large models and often works with unscaled learning rates.
* Clipping prevents the update step from becoming very large, which would destabilize training
* Clipping mitigates the effects of very large gradients preventing numerical instability
Therefore, implementing clipping helps ensure stability and efficient training without requiring per-parameter scaling like Adam.

=== Why Adafactor is more memory efficient, compared to Adam ===
'''Row-wise and Column-wise Second Moment Updates'''
*<math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
*<math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
Instead of storing the full <math>G_t^2</math>, Adafactor computes the row and column respectively, which reduces the memory requirements from <math>O(n\times m)</math> to <math>O(n + m)</math>

'''Factored Representation of the Second Moment'''
* <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
This updates the second momentum based on the outer product <math>R_t C_t</math>.
*However, this is not <math>O(n\times m)</math> since
** The operation is performed element-wise, so it actually never materializes <math>\hat{V_t}</math> as a <math>n\times n</math> matrix
** It also only storing <math>R_t</math>and <math> C_t</math> instead of storage the full second-moment matrix

=== 4. Proposed Hyperparameters for Adafactor ===
* Regularization constant 1: <math>\epsilon_1 = 10^{-30}</math>
* Regularization constant 2: <math>\epsilon_2 = 10^{-3}</math>
* Clipping threshold: <math>d = 1</math>
* Relative step size: <math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math>
* Second moment decay: <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>

== Numerical Examples ==
Step-by-step instructions for determining the result of the first iteration.

'''<big>Problem setup</big>'''

'''Initial weights ('''<math>X_0</math>'''):'''

<math>X_0 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}</math>

'''Gradient for first iteration (<math>G_1</math>):'''

Gradient of the loss function with respect to X

<math>G_1 = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math>

'''<big>Hyperparameters setup</big>'''

<math>\epsilon_1 = 10^{-30}</math> (Minimum learning rate scaling factor))

<math>\epsilon_2 = 10^{-3}</math> (Regularization constant)

<math>d = 1</math> (Clipping threshold)

<math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> (Relative step size)

<math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> (Second moment decay)

'''<big>Step 1: Learning Rate Scaling</big>'''

Define the relative step size

<math>\rho_1 = \min(10^{-2}, 1/\sqrt{1})= 10^{-2}</math>

'''Step 1.1: Root Mean Square(RMS) calculation for <math>X_0</math>'''

Root Mean Square(RMS) calculation for <math>X_0</math>

RMS formula

<math>RMS(X_0) = \sqrt{\tfrac{1}{n}\sum_{i=1}^n X_0[i]^2}</math>

Substitute the initial weights

<math>RMS(X_0) = \sqrt{\tfrac{1}{9}(0.72^2+(-0.5)^2+0.9^2+(-1.1)^2+0.8^2+(-0.6)^2+1.2^2+(-0.7)^2+0.4^2)}</math>

<math>RMS(X_0) = \sqrt{\frac{6.85}{9}}\approx 0.806</math>

'''Step 1.2: Find the Learning Rate Scaling ('''<math>\alpha_t</math>'''):'''

Learning rate formula

<math>\alpha_1 = max(\epsilon_2,RMS(X_0))\cdot p_1</math>

Substitute the RMS

<math>\alpha_1 = max(0.001,0.806)\cdot 0.01=0.00806</math>

'''<big>Step 2: Compute <math>G^{2}_t</math> (Element-wise Square of Gradient)</big>'''

Compute the squared value of each element in the gradient matrix '''<math>G_t</math>'''.

<math>G^{2}_1 = \begin{bmatrix} 0.3^2&(-0.2)^2&0.4^2\\ (-0.5)^2&0.6^2&(-0.1)^2\\0.2^2&(-0.4)^2 &0.3^2 \end{bmatrix}</math>

<math>G^{2}_1 = \begin{bmatrix} 0.09& 0.04&0.16\\ 0.25&0.36&0.01\\0.04&0.16&0.09\end{bmatrix}</math>

'''<big>Step 3: Find the moment estimate</big>'''

Compute the exponential moving average of squared gradients to capture the variance or scale of gradients.

'''Step 3.1: Compute row moments (<math>R_t</math>)'''

This equation computes the row-wise second moments ('''<math>R_t</math>''' ) as an exponential moving average of past moments ('''<math>R_{t-1}</math>''') and the current row-wise mean of squared gradients ( <math>G^{2}_t</math> ), with a balance controlled by (<math>\hat{\beta}_{2t}</math>).

For <math>G^{2}_t=\mathbb{R}^{m\times n} </math>

<math>R_t = \hat{\beta_{2t}} \cdot R_{t-1} + (1-\hat{\beta})\cdot (\tfrac{1}{m}\sum_{j=1}^m G^{2}_t[i,j]+\epsilon_1) </math>

Since <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>, for first iteration: <math>\hat{\beta}_{21} = 0</math>. And because <math>\epsilon_1 </math> is too small, we can ignore it. The update of '''<math>R_t</math>''' is:

<math>R_{1} = \tfrac{1}{m}\textstyle \sum_{j=1}^m \displaystyle G^{2}_1[i,j] </math>

Row-wise mean ('''<math>R_t</math>'''):

<math>R_1 = \begin{bmatrix} \tfrac{0.09+0.04+0.16}{3} \\ \tfrac{0.25+0.36+0.01}{3}\\\tfrac{0.04+0.16+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.0967\\ 0.2067\\0.0967\end{bmatrix} </math>

'''Step 3.2: Compute column moments (<math>C_t</math>)'''

The process is same as row moments.

<math>C_t = \hat{\beta}\cdot C_{{t-1}} + (1-\hat{\beta})\cdot (\tfrac{1}{n}\sum_{j=1}^n G^{2}_t[i,j]+\epsilon_1) </math>

Column-wise mean (<math>C_t</math>):

<math>C_1 = \begin{bmatrix} \tfrac{0.09+025+0.04}{3} \\ \tfrac{0.04+0.36+0.16}{3}\\\tfrac{0.16+0.01+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.1267\\ 0.1867\\0.0867\end{bmatrix} </math>

'''Step 3.3: Second Moment Estimate ('''<math>\hat{V_t}</math>''')'''

The Second Moment Estimate is calculated as the outer product of the row moments ('''<math>R_t</math>''') and column moments ('''<math>C_t</math>''').

<math>\hat{V}_t = R_t \otimes C_t</math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0967\\0.2067\\0.0967 \end{bmatrix} \otimes \begin{bmatrix} 0.1267&0.1867&0.0867\\ \end{bmatrix} </math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\ 0.0122&0.0180&0.0084\end{bmatrix} </math>

'''<big>Step 4: Update the vector (<math>U_t </math>)</big>'''

Computed by scaling the gradient matrix '''<math>G_t</math>''' element-wise with the inverse square root of the second moment estimate (<math>\hat{V_t}</math>)

'''step 4.1: Find the vector value of <math>U_t </math>'''

Formula of '''<math>U_t </math>'''

<math>U_t = \frac{G_t}{\sqrt{\hat{V_t}+\epsilon_1}} </math>

Substitute '''<math>C_t</math>''' and <math>V_t</math>

<math>U_1 = \frac{\begin{bmatrix}0.3&-0.2&0.4 \\ -0.5&0.6&-0.1\\0.2&-0.4&0.3 \end{bmatrix}}{\sqrt{\begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\0.0122&0.0180&0.0084 \end{bmatrix}}} </math>

<math>U_1 = \begin{bmatrix} 2.711&-1.489&4.370\\-3.090&3.055&-0.747\\1.807&-2.978&3.278 \end{bmatrix} </math>

'''step 4.2: Clipped Update Vector <math>\hat{U_t} </math>'''

Scale the update vector ( '''<math>U_t </math>''' ) to ensure its RMS value does not exceed a predefined clipping threshold (<math>d </math>), maintaining stability in updates.

Formula of '''<math>\hat{U_t} </math>'''

'''<math>\hat{U_t} = \frac{U_t}{max(1,\tfrac{RMS(U_t)}{d}) } </math>'''

Compute RMS of '''<math>U_t </math>'''

'''<math>RMS(U_1) = \sqrt{\tfrac{1}{9} \sum_{i=1}^9 U_t[i]^2} \approx 3.303 </math>'''

Since RMS('''<math>U_t </math>''')>d, scale '''<math>U_t </math>''' by <math>\tfrac{1}{3.303} </math>

'''<math>\hat{U_1} = \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>'''

'''<big>Step 5: Weight Update (</big>'''<math>X_1 </math>'''<big>)</big>'''

Adjust the weights (<math>X_t </math>) by subtracting the product of the learning rate (<math>\alpha_t </math>) and the clipped update vector (<math>\hat{U_t} </math> ).

<math>X_1 = X_0 - \alpha \cdot \hat{U_t}</math>

The result for first iteration.

<math>X_1 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix} - 0.00806 \cdot \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>

<math>X_1 = \begin{bmatrix} 0.692&-0.496&0.887 \\-1.091&0.791&-0.596\\ 1.195&-0.691&0.391\end{bmatrix} </math>

== Applications ==
Adafactor is an efficient adaptive optimizer designed specifically for large-scale deep learning tasks. Its unique memory-saving properties have made it widely used for training large-scale language models, image recognition models, and reinforcement learning policy networks. Compared to other optimizers (e.g., Adam), Adafactor delivers exceptional performance in large-scale computations while significantly reducing memory requirements. Below are several specific application scenarios of Adafactor:

'''1. Natural Language Processing (NLP)'''

In NLP tasks, Adafactor has been successfully applied to training ultra-large-scale language models, such as Google’s Transformer and T5 (Text-To-Text Transfer Transformer). By significantly reducing memory usage during the gradient update process, Adafactor enables efficient model training in resource-constrained environments. For example, the T5 model in Google’s research employed Adafactor to effectively train on large datasets through text-to-text conversion tasks.2

'''2. Training Large-Scale Language Models'''

Adafactor has been used to train large-scale language models like LLaMA, combining it with novel preconditioned diagonalization methods to significantly enhance training efficiency. Experiments showed that Adafactor achieved performance comparable to the Adam optimizer while consuming substantially less memory and computational resources.3

'''3. Humor Detection Tasks'''

Adafactor has been utilized to optimize ALBERT-based models for humor detection tasks. Configured as an adaptive learning rate optimizer and paired with a cross-entropy loss function, Adafactor was used to train models that achieved 99% accuracy and F1 scores. Moreover, training time was faster than with Adam, completing in approximately 43 minutes. Comparisons with Adam and AdaBound optimizers demonstrated that Adafactor excelled in terms of both time efficiency and performance, especially in accuracy, recall, and F1 scores for humor detection tasks .4

'''4. Multilingual Model Training'''

In training multilingual models, Adafactor improved scalability and efficiency, particularly by significantly reducing memory consumption when handling large-scale parameters.5

'''5. Pretraining Vision Models'''

When training ResNet50 and ViT on the ImageNet1k dataset, Adafactor successfully optimized these deep networks with its low memory requirements. Additionally, with new algorithms combining preconditioned diagonalization methods (e.g., AdafacDiag and AdafacDiag++), it outperformed the standard Adam optimizer in both convergence speed and final accuracy.6

=== '''Software Tools and Platforms''' ===
Adafactor has been integrated into the following mainstream deep learning frameworks, making it accessible to developers:

'''TensorFlow''': Provides a built-in implementation of Adafactor, supporting T5 model optimization.7

'''PyTorch:''' PyTorch provides the Adafactor optimizer through the torch.optim.AdaFactor class.8

'''JAX/Flax:''' JAX provides an optimizer library called Optax, which includes the Adafactor optimizer.9

=== '''Future Prospects''' ===
As the scale of deep learning models continues to grow, Adafactor’s memory-saving and computational efficiency advantages will become increasingly important. In the training of ultra-large-scale models (e.g., GPT and Vision Transformers), Adafactor is expected to become an indispensable optimization tool. Furthermore, by combining with other optimization strategies, such as mixed precision training, Adafactor may further enhance its applicability in both industrial and research settings.

== Conclusion ==
== Reference ==

Adafactor

2024-12-12T16:49:42Z

Fall2024 Wiki Team6: Undo revision 6933 by Fall2024 Wiki Team6 (talk)

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

== Introduction ==
Adafactor is an efficient, adaptive learning rate optimization algorithm proposed by Noam Shazeer and Mitchell Stern in 2018. 1

Unlike traditional Adam optimizers, Adafactor does not store complete second-order moment matrices. Instead, it employs a factorization approach that only maintains gradient statistics for the rows and columns of parameter matrices, significantly reducing memory usage. Moreover, Adafactor uses an adaptive learning rate, allowing it to dynamically adjust step sizes without the need for manually setting a global learning rate or relying heavily on hyperparameter tuning. Its design also defaults to not performing bias correction, yet it remains stable in scenarios involving large-batch training data.1 This efficiency makes it an ideal choice for training ultra-large-scale models such as T5.2

Adafactor’s efficient memory usage and outstanding performance make it widely applicable in scenarios such as Natural Language Processing (NLP).2 Compared to the Adam optimizer, Adafactor significantly reduces memory and computational resource requirements while maintaining comparable performance when training large-scale language models and vision models. 3,6

== Problem formulation ==
=== 1. Objective ===
Minimize the loss function <math>f(x)</math>, where <math>x \in R^n</math> and <math>x</math> is the weight vector to be optimized.

=== 2. Parameters ===
*''' Gradient:'''
<math>G_t = \nabla f(x_{t-1})</math>

* '''Second moment estimate:'''

<math> \hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>

* '''Where:'''
** <math>\hat{V}_t</math> is the running average of the squared gradient.
**<math>\hat{\beta}_{2t}</math> is the corrected decay parameter.
**<math>\epsilon_1</math> is a regularization constant.

* '''Step size:'''
<math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math>
* '''Where''':
** <math>\rho_t</math> is the relative step size.
** <math>\epsilon_2</math> is a regularization constant.
** <math>\text{RMS}</math> is the root mean square, defined as:
*** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
*** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>

=== 3. Algorithms ===
==== Adafactor for Weighted Vectors ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^n</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update second moment estimate: <math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

==== Adafactor for Weighted Matrices ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^{n \times m}</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update row-wise second moment: <math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
** Update column-wise second moment: <math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
** Update overall second moment estimate: <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

=== Why Adafactor is more memory efficient, compared to Adam ===
'''Row-wise and Column-wise Second Moment Updates'''
*<math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
*<math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
Instead of storing the full <math>G_t^2</math>, Adafactor computes the row and column respectively, which reduces the memory requirements from <math>O(n\times m)</math> to <math>O(n + m)</math>

'''Factored Representation of the Second Moment'''
* <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
This updates the second momentum based on the outer product <math>R_t C_t</math>.
*However, this is not <math>O(n\times m)</math> since
** The operation is performed element-wise, so it actually never materializes <math>\hat{V_t}</math> as a <math>n\times n</math> matrix
** It also only storing <math>R_t</math>and <math> C_t</math> instead of storage the full second-moment matrix

=== 4. Proposed Hyperparameters for Adafactor ===
* Regularization constant 1: <math>\epsilon_1 = 10^{-30}</math>
* Regularization constant 2: <math>\epsilon_2 = 10^{-3}</math>
* Clipping threshold: <math>d = 1</math>
* Relative step size: <math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math>
* Second moment decay: <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>

== Numerical Examples ==
Step-by-step instructions for determining the result of the first iteration.

'''<big>Problem setup</big>'''

'''Initial weights ('''<math>X_0</math>'''):'''

<math>X_0 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}</math>

'''Gradient for first iteration (<math>G_1</math>):'''

Gradient of the loss function with respect to X

<math>G_1 = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math>

'''<big>Hyperparameters setup</big>'''

<math>\epsilon_1 = 10^{-30}</math> (Minimum learning rate scaling factor))

<math>\epsilon_2 = 10^{-3}</math> (Regularization constant)

<math>d = 1</math> (Clipping threshold)

<math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> (Relative step size)

<math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> (Second moment decay)

'''<big>Step 1: Learning Rate Scaling</big>'''

Define the relative step size

<math>\rho_1 = \min(10^{-2}, 1/\sqrt{1})= 10^{-2}</math>

'''Step 1.1: Root Mean Square(RMS) calculation for <math>X_0</math>'''

Root Mean Square(RMS) calculation for <math>X_0</math>

RMS formula

<math>RMS(X_0) = \sqrt{\tfrac{1}{n}\sum_{i=1}^n X_0[i]^2}</math>

Substitute the initial weights

<math>RMS(X_0) = \sqrt{\tfrac{1}{9}(0.72^2+(-0.5)^2+0.9^2+(-1.1)^2+0.8^2+(-0.6)^2+1.2^2+(-0.7)^2+0.4^2)}</math>

<math>RMS(X_0) = \sqrt{\frac{6.85}{9}}\approx 0.806</math>

'''Step 1.2: Find the Learning Rate Scaling ('''<math>\alpha_t</math>'''):'''

Learning rate formula

<math>\alpha_1 = max(\epsilon_2,RMS(X_0))\cdot p_1</math>

Substitute the RMS

<math>\alpha_1 = max(0.001,0.806)\cdot 0.01=0.00806</math>

'''<big>Step 2: Compute <math>G^{2}_t</math> (Element-wise Square of Gradient)</big>'''

Compute the squared value of each element in the gradient matrix '''<math>G_t</math>'''.

<math>G^{2}_1 = \begin{bmatrix} 0.3^2&(-0.2)^2&0.4^2\\ (-0.5)^2&0.6^2&(-0.1)^2\\0.2^2&(-0.4)^2 &0.3^2 \end{bmatrix}</math>

<math>G^{2}_1 = \begin{bmatrix} 0.09& 0.04&0.16\\ 0.25&0.36&0.01\\0.04&0.16&0.09\end{bmatrix}</math>

'''<big>Step 3: Find the moment estimate</big>'''

Compute the exponential moving average of squared gradients to capture the variance or scale of gradients.

'''Step 3.1: Compute row moments (<math>R_t</math>)'''

This equation computes the row-wise second moments ('''<math>R_t</math>''' ) as an exponential moving average of past moments ('''<math>R_{t-1}</math>''') and the current row-wise mean of squared gradients ( <math>G^{2}_t</math> ), with a balance controlled by (<math>\hat{\beta}_{2t}</math>).

For <math>G^{2}_t=\mathbb{R}^{m\times n} </math>

<math>R_t = \hat{\beta_{2t}} \cdot R_{t-1} + (1-\hat{\beta})\cdot (\tfrac{1}{m}\sum_{j=1}^m G^{2}_t[i,j]+\epsilon_1) </math>

Since <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>, for first iteration: <math>\hat{\beta}_{21} = 0</math>. And because <math>\epsilon_1 </math> is too small, we can ignore it. The update of '''<math>R_t</math>''' is:

<math>R_{1} = \tfrac{1}{m}\textstyle \sum_{j=1}^m \displaystyle G^{2}_1[i,j] </math>

Row-wise mean ('''<math>R_t</math>'''):

<math>R_1 = \begin{bmatrix} \tfrac{0.09+0.04+0.16}{3} \\ \tfrac{0.25+0.36+0.01}{3}\\\tfrac{0.04+0.16+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.0967\\ 0.2067\\0.0967\end{bmatrix} </math>

'''Step 3.2: Compute column moments (<math>C_t</math>)'''

The process is same as row moments.

<math>C_t = \hat{\beta}\cdot C_{{t-1}} + (1-\hat{\beta})\cdot (\tfrac{1}{n}\sum_{j=1}^n G^{2}_t[i,j]+\epsilon_1) </math>

Column-wise mean (<math>C_t</math>):

<math>C_1 = \begin{bmatrix} \tfrac{0.09+025+0.04}{3} \\ \tfrac{0.04+0.36+0.16}{3}\\\tfrac{0.16+0.01+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.1267\\ 0.1867\\0.0867\end{bmatrix} </math>

'''Step 3.3: Second Moment Estimate ('''<math>\hat{V_t}</math>''')'''

The Second Moment Estimate is calculated as the outer product of the row moments ('''<math>R_t</math>''') and column moments ('''<math>C_t</math>''').

<math>\hat{V}_t = R_t \otimes C_t</math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0967\\0.2067\\0.0967 \end{bmatrix} \otimes \begin{bmatrix} 0.1267&0.1867&0.0867\\ \end{bmatrix} </math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\ 0.0122&0.0180&0.0084\end{bmatrix} </math>

'''<big>Step 4: Update the vector (<math>U_t </math>)</big>'''

Computed by scaling the gradient matrix '''<math>G_t</math>''' element-wise with the inverse square root of the second moment estimate (<math>\hat{V_t}</math>)

'''step 4.1: Find the vector value of <math>U_t </math>'''

Formula of '''<math>U_t </math>'''

<math>U_t = \frac{G_t}{\sqrt{\hat{V_t}+\epsilon_1}} </math>

Substitute '''<math>C_t</math>''' and <math>V_t</math>

<math>U_1 = \frac{\begin{bmatrix}0.3&-0.2&0.4 \\ -0.5&0.6&-0.1\\0.2&-0.4&0.3 \end{bmatrix}}{\sqrt{\begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\0.0122&0.0180&0.0084 \end{bmatrix}}} </math>

<math>U_1 = \begin{bmatrix} 2.711&-1.489&4.370\\-3.090&3.055&-0.747\\1.807&-2.978&3.278 \end{bmatrix} </math>

'''step 4.2: Clipped Update Vector <math>\hat{U_t} </math>'''

Scale the update vector ( '''<math>U_t </math>''' ) to ensure its RMS value does not exceed a predefined clipping threshold (<math>d </math>), maintaining stability in updates.

Formula of '''<math>\hat{U_t} </math>'''

'''<math>\hat{U_t} = \frac{U_t}{max(1,\tfrac{RMS(U_t)}{d}) } </math>'''

Compute RMS of '''<math>U_t </math>'''

'''<math>RMS(U_1) = \sqrt{\tfrac{1}{9} \sum_{i=1}^9 U_t[i]^2} \approx 3.303 </math>'''

Since RMS('''<math>U_t </math>''')>d, scale '''<math>U_t </math>''' by <math>\tfrac{1}{3.303} </math>

'''<math>\hat{U_1} = \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>'''

'''<big>Step 5: Weight Update (</big>'''<math>X_1 </math>'''<big>)</big>'''

Adjust the weights (<math>X_t </math>) by subtracting the product of the learning rate (<math>\alpha_t </math>) and the clipped update vector (<math>\hat{U_t} </math> ).

<math>X_1 = X_0 - \alpha \cdot \hat{U_t}</math>

The result for first iteration.

<math>X_1 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix} - 0.00806 \cdot \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>

<math>X_1 = \begin{bmatrix} 0.692&-0.496&0.887 \\-1.091&0.791&-0.596\\ 1.195&-0.691&0.391\end{bmatrix} </math>

== Applications ==
Adafactor is an efficient adaptive optimizer designed specifically for large-scale deep learning tasks. Its unique memory-saving properties have made it widely used for training large-scale language models, image recognition models, and reinforcement learning policy networks. Compared to other optimizers (e.g., Adam), Adafactor delivers exceptional performance in large-scale computations while significantly reducing memory requirements. Below are several specific application scenarios of Adafactor:

'''1. Natural Language Processing (NLP)'''

In NLP tasks, Adafactor has been successfully applied to training ultra-large-scale language models, such as Google’s Transformer and T5 (Text-To-Text Transfer Transformer). By significantly reducing memory usage during the gradient update process, Adafactor enables efficient model training in resource-constrained environments. For example, the T5 model in Google’s research employed Adafactor to effectively train on large datasets through text-to-text conversion tasks.2

'''2. Training Large-Scale Language Models'''

Adafactor has been used to train large-scale language models like LLaMA, combining it with novel preconditioned diagonalization methods to significantly enhance training efficiency. Experiments showed that Adafactor achieved performance comparable to the Adam optimizer while consuming substantially less memory and computational resources.3

'''3. Humor Detection Tasks'''

Adafactor has been utilized to optimize ALBERT-based models for humor detection tasks. Configured as an adaptive learning rate optimizer and paired with a cross-entropy loss function, Adafactor was used to train models that achieved 99% accuracy and F1 scores. Moreover, training time was faster than with Adam, completing in approximately 43 minutes. Comparisons with Adam and AdaBound optimizers demonstrated that Adafactor excelled in terms of both time efficiency and performance, especially in accuracy, recall, and F1 scores for humor detection tasks .4

'''4. Multilingual Model Training'''

In training multilingual models, Adafactor improved scalability and efficiency, particularly by significantly reducing memory consumption when handling large-scale parameters.5

'''5. Pretraining Vision Models'''

When training ResNet50 and ViT on the ImageNet1k dataset, Adafactor successfully optimized these deep networks with its low memory requirements. Additionally, with new algorithms combining preconditioned diagonalization methods (e.g., AdafacDiag and AdafacDiag++), it outperformed the standard Adam optimizer in both convergence speed and final accuracy.6

=== '''Software Tools and Platforms''' ===
Adafactor has been integrated into the following mainstream deep learning frameworks, making it accessible to developers:

'''TensorFlow''': Provides a built-in implementation of Adafactor, supporting T5 model optimization.7

'''PyTorch:''' PyTorch provides the Adafactor optimizer through the torch.optim.AdaFactor class.8

'''JAX/Flax:''' JAX provides an optimizer library called Optax, which includes the Adafactor optimizer.9

=== '''Future Prospects''' ===
As the scale of deep learning models continues to grow, Adafactor’s memory-saving and computational efficiency advantages will become increasingly important. In the training of ultra-large-scale models (e.g., GPT and Vision Transformers), Adafactor is expected to become an indispensable optimization tool. Furthermore, by combining with other optimization strategies, such as mixed precision training, Adafactor may further enhance its applicability in both industrial and research settings.

== Conclusion ==
== Reference ==

Adafactor

2024-12-12T01:07:02Z

Fall2024 Wiki Team6: /* Applications */

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

== Introduction ==
Adafactor is an efficient, adaptive learning rate optimization algorithm proposed by Noam Shazeer and Mitchell Stern in 2018. 1

Unlike traditional Adam optimizers, Adafactor does not store complete second-order moment matrices. Instead, it employs a factorization approach that only maintains gradient statistics for the rows and columns of parameter matrices, significantly reducing memory usage. Moreover, Adafactor uses an adaptive learning rate, allowing it to dynamically adjust step sizes without the need for manually setting a global learning rate or relying heavily on hyperparameter tuning. Its design also defaults to not performing bias correction, yet it remains stable in scenarios involving large-batch training data.1 This efficiency makes it an ideal choice for training ultra-large-scale models such as T5.2

Adafactor’s efficient memory usage and outstanding performance make it widely applicable in scenarios such as Natural Language Processing (NLP).2 Compared to the Adam optimizer, Adafactor significantly reduces memory and computational resource requirements while maintaining comparable performance when training large-scale language models and vision models. 3,6

== Problem formulation ==
=== 1. Objective ===
Minimize the loss function <math>f(x)</math>, where <math>x \in R^n</math> and <math>x</math> is the weight vector to be optimized.

=== 2. Parameters ===
*''' Gradient:'''
<math>G_t = \nabla f(x_{t-1})</math>

* '''Second moment estimate:'''

<math> \hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>

* '''Where:'''
** <math>\hat{V}_t</math> is the running average of the squared gradient.
**<math>\hat{\beta}_{2t}</math> is the corrected decay parameter.
**<math>\epsilon_1</math> is a regularization constant.

* '''Step size:'''
<math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math>
* '''Where''':
** <math>\rho_t</math> is the relative step size.
** <math>\epsilon_2</math> is a regularization constant.
** <math>\text{RMS}</math> is the root mean square, defined as:
*** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
*** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>

=== 3. Algorithms ===
==== Adafactor for Weighted Vectors ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^n</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update second moment estimate: <math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

==== Adafactor for Weighted Matrices ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^{n \times m}</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update row-wise second moment: <math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
** Update column-wise second moment: <math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
** Update overall second moment estimate: <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

=== 4. Proposed Hyperparameters for Adafactor ===
* Regularization constant 1: <math>\epsilon_1 = 10^{-30}</math>
* Regularization constant 2: <math>\epsilon_2 = 10^{-3}</math>
* Clipping threshold: <math>d = 1</math>
* Relative step size: <math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math>
* Second moment decay: <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>

== Numerical Examples ==
Step-by-step instructions for determining the result of the first iteration.

'''<big>Problem setup</big>'''

'''Initial weights ('''<math>X_0</math>'''):'''

<math>X_0 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}</math>

'''Gradient for first iteration (<math>G_1</math>):'''

Gradient of the loss function with respect to X

<math>G_1 = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math>

'''<big>Hyperparameters setup</big>'''

<math>\epsilon_1 = 10^{-30}</math> (Minimum learning rate scaling factor))

<math>\epsilon_2 = 10^{-3}</math> (Regularization constant)

<math>d = 1</math> (Clipping threshold)

<math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> (Relative step size)

<math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> (Second moment decay)

'''<big>Step 1: Learning Rate Scaling</big>'''

Define the relative step size

<math>\rho_1 = \min(10^{-2}, 1/\sqrt{1})= 10^{-2}</math>

'''Step 1.1: Root Mean Square(RMS) calculation for <math>X_0</math>'''

Root Mean Square(RMS) calculation for <math>X_0</math>

RMS formula

<math>RMS(X_0) = \sqrt{\tfrac{1}{n}\sum_{i=1}^n X_0[i]^2}</math>

Substitute the initial weights

<math>RMS(X_0) = \sqrt{\tfrac{1}{9}(0.72^2+(-0.5)^2+0.9^2+(-1.1)^2+0.8^2+(-0.6)^2+1.2^2+(-0.7)^2+0.4^2)}</math>

<math>RMS(X_0) = \sqrt{\frac{6.85}{9}}\approx 0.806</math>

'''Step 1.2: Find the Learning Rate Scaling ('''<math>\alpha_t</math>'''):'''

Learning rate formula

<math>\alpha_1 = max(\epsilon_2,RMS(X_0))\cdot p_1</math>

Substitute the RMS

<math>\alpha_1 = max(0.001,0.806)\cdot 0.01=0.00806</math>

'''<big>Step 2: Compute <math>G^{2}_t</math> (Element-wise Square of Gradient)</big>'''

Compute the squared value of each element in the gradient matrix '''<math>G_t</math>'''.

<math>G^{2}_1 = \begin{bmatrix} 0.3^2&(-0.2)^2&0.4^2\\ (-0.5)^2&0.6^2&(-0.1)^2\\0.2^2&(-0.4)^2 &0.3^2 \end{bmatrix}</math>

<math>G^{2}_1 = \begin{bmatrix} 0.09& 0.04&0.16\\ 0.25&0.36&0.01\\0.04&0.16&0.09\end{bmatrix}</math>

'''<big>Step 3: Find the moment estimate</big>'''

Compute the exponential moving average of squared gradients to capture the variance or scale of gradients.

'''Step 3.1: Compute row moments (<math>R_t</math>)'''

This equation computes the row-wise second moments ('''<math>R_t</math>''' ) as an exponential moving average of past moments ('''<math>R_{t-1}</math>''') and the current row-wise mean of squared gradients ( <math>G^{2}_t</math> ), with a balance controlled by (<math>\hat{\beta}_{2t}</math>).

For <math>G^{2}_t=\mathbb{R}^{m\times n} </math>

<math>R_t = \hat{\beta_{2t}} \cdot R_{t-1} + (1-\hat{\beta})\cdot (\tfrac{1}{m}\sum_{j=1}^m G^{2}_t[i,j]+\epsilon_1) </math>

Since <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>, for first iteration: <math>\hat{\beta}_{21} = 0</math>. And because <math>\epsilon_1 </math> is too small, we can ignore it. The update of '''<math>R_t</math>''' is:

<math>R_{1} = \tfrac{1}{m}\textstyle \sum_{j=1}^m \displaystyle G^{2}_1[i,j] </math>

Row-wise mean ('''<math>R_t</math>'''):

<math>R_1 = \begin{bmatrix} \tfrac{0.09+0.04+0.16}{3} \\ \tfrac{0.25+0.36+0.01}{3}\\\tfrac{0.04+0.16+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.0967\\ 0.2067\\0.0967\end{bmatrix} </math>

'''Step 3.2: Compute column moments (<math>C_t</math>)'''

The process is same as row moments.

<math>C_t = \hat{\beta}\cdot C_{{t-1}} + (1-\hat{\beta})\cdot (\tfrac{1}{n}\sum_{j=1}^n G^{2}_t[i,j]+\epsilon_1) </math>

Column-wise mean (<math>C_t</math>):

<math>C_1 = \begin{bmatrix} \tfrac{0.09+025+0.04}{3} \\ \tfrac{0.04+0.36+0.16}{3}\\\tfrac{0.16+0.01+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.1267\\ 0.1867\\0.0867\end{bmatrix} </math>

'''Step 3.3: Second Moment Estimate ('''<math>\hat{V_t}</math>''')'''

The Second Moment Estimate is calculated as the outer product of the row moments ('''<math>R_t</math>''') and column moments ('''<math>C_t</math>''').

<math>\hat{V}_t = R_t \otimes C_t</math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0967\\0.2067\\0.0967 \end{bmatrix} \otimes \begin{bmatrix} 0.1267&0.1867&0.0867\\ \end{bmatrix} </math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\ 0.0122&0.0180&0.0084\end{bmatrix} </math>

'''<big>Step 4: Update the vector (<math>U_t </math>)</big>'''

Computed by scaling the gradient matrix '''<math>G_t</math>''' element-wise with the inverse square root of the second moment estimate (<math>\hat{V_t}</math>)

'''step 4.1: Find the vector value of <math>U_t </math>'''

Formula of '''<math>U_t </math>'''

<math>U_t = \frac{G_t}{\sqrt{\hat{V_t}+\epsilon_1}} </math>

Substitute '''<math>C_t</math>''' and <math>V_t</math>

<math>U_1 = \frac{\begin{bmatrix}0.3&-0.2&0.4 \\ -0.5&0.6&-0.1\\0.2&-0.4&0.3 \end{bmatrix}}{\sqrt{\begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\0.0122&0.0180&0.0084 \end{bmatrix}}} </math>

<math>U_1 = \begin{bmatrix} 2.711&-1.489&4.370\\-3.090&3.055&-0.747\\1.807&-2.978&3.278 \end{bmatrix} </math>

'''step 4.2: Clipped Update Vector <math>\hat{U_t} </math>'''

Scale the update vector ( '''<math>U_t </math>''' ) to ensure its RMS value does not exceed a predefined clipping threshold (<math>d </math>), maintaining stability in updates.

Formula of '''<math>\hat{U_t} </math>'''

'''<math>\hat{U_t} = \frac{U_t}{max(1,\tfrac{RMS(U_t)}{d}) } </math>'''

Compute RMS of '''<math>U_t </math>'''

'''<math>RMS(U_1) = \sqrt{\tfrac{1}{9} \sum_{i=1}^9 U_t[i]^2} \approx 3.303 </math>'''

Since RMS('''<math>U_t </math>''')>d, scale '''<math>U_t </math>''' by <math>\tfrac{1}{3.303} </math>

'''<math>\hat{U_1} = \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>'''

'''<big>Step 5: Weight Update (</big>'''<math>X_1 </math>'''<big>)</big>'''

Adjust the weights (<math>X_t </math>) by subtracting the product of the learning rate (<math>\alpha_t </math>) and the clipped update vector (<math>\hat{U_t} </math> ).

<math>X_1 = X_0 - \alpha \cdot \hat{U_t}</math>

The result for first iteration.

<math>X_1 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix} - 0.00806 \cdot \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>

<math>X_1 = \begin{bmatrix} 0.692&-0.496&0.887 \\-1.091&0.791&-0.596\\ 1.195&-0.691&0.391\end{bmatrix} </math>

== Applications ==
Adafactor is an efficient adaptive optimizer designed specifically for large-scale deep learning tasks. Its unique memory-saving properties have made it widely used for training large-scale language models, image recognition models, and reinforcement learning policy networks. Compared to other optimizers (e.g., Adam), Adafactor delivers exceptional performance in large-scale computations while significantly reducing memory requirements. Below are several specific application scenarios of Adafactor:

'''1. Natural Language Processing (NLP)'''

In NLP tasks, Adafactor has been successfully applied to training ultra-large-scale language models, such as Google’s Transformer and T5 (Text-To-Text Transfer Transformer). By significantly reducing memory usage during the gradient update process, Adafactor enables efficient model training in resource-constrained environments. For example, the T5 model in Google’s research employed Adafactor to effectively train on large datasets through text-to-text conversion tasks.2

'''2. Training Large-Scale Language Models'''

Adafactor has been used to train large-scale language models like LLaMA, combining it with novel preconditioned diagonalization methods to significantly enhance training efficiency. Experiments showed that Adafactor achieved performance comparable to the Adam optimizer while consuming substantially less memory and computational resources.3

'''3. Humor Detection Tasks'''

Adafactor has been utilized to optimize ALBERT-based models for humor detection tasks. Configured as an adaptive learning rate optimizer and paired with a cross-entropy loss function, Adafactor was used to train models that achieved 99% accuracy and F1 scores. Moreover, training time was faster than with Adam, completing in approximately 43 minutes. Comparisons with Adam and AdaBound optimizers demonstrated that Adafactor excelled in terms of both time efficiency and performance, especially in accuracy, recall, and F1 scores for humor detection tasks .4

'''4. Multilingual Model Training'''

In training multilingual models, Adafactor improved scalability and efficiency, particularly by significantly reducing memory consumption when handling large-scale parameters.5

'''5. Pretraining Vision Models'''

When training ResNet50 and ViT on the ImageNet1k dataset, Adafactor successfully optimized these deep networks with its low memory requirements. Additionally, with new algorithms combining preconditioned diagonalization methods (e.g., AdafacDiag and AdafacDiag++), it outperformed the standard Adam optimizer in both convergence speed and final accuracy.6

=== '''Software Tools and Platforms''' ===
Adafactor has been integrated into the following mainstream deep learning frameworks, making it accessible to developers:

'''TensorFlow''': Provides a built-in implementation of Adafactor, supporting T5 model optimization.7

'''PyTorch:''' PyTorch provides the Adafactor optimizer through the torch.optim.AdaFactor class.8

'''JAX/Flax:''' JAX provides an optimizer library called Optax, which includes the Adafactor optimizer.9

=== '''Future Prospects''' ===
As the scale of deep learning models continues to grow, Adafactor’s memory-saving and computational efficiency advantages will become increasingly important. In the training of ultra-large-scale models (e.g., GPT and Vision Transformers), Adafactor is expected to become an indispensable optimization tool. Furthermore, by combining with other optimization strategies, such as mixed precision training, Adafactor may further enhance its applicability in both industrial and research settings.

== Conclusion ==
== Reference ==

Adafactor

2024-12-12T01:02:47Z

Fall2024 Wiki Team6: /* Introduction */

Adafactor

2024-12-12T00:51:19Z

Fall2024 Wiki Team6: /* Introduction */

Adafactor

2024-12-12T00:49:58Z

Fall2024 Wiki Team6: /* Introduction */

Adafactor

2024-12-11T22:02:12Z

Fall2024 Wiki Team6: /* Numerical Examples */

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

== Introduction ==
== Problem formulation ==
=== 1. Objective ===
Minimize the loss function <math>f(x)</math>, where <math>x \in R^n</math> and <math>x</math> is the weight vector to be optimized.

=== 2. Parameters ===
*''' Gradient:'''
<math>G_t = \nabla f(x_{t-1})</math>

* '''Second moment estimate:'''

<math> \hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>

* '''Where:'''
** <math>\hat{V}_t</math> is the running average of the squared gradient.
**<math>\hat{\beta}_{2t}</math> is the corrected decay parameter.
**<math>\epsilon_1</math> is a regularization constant.

* '''Step size:'''
<math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math>
* '''Where''':
** <math>\rho_t</math> is the relative step size.
** <math>\epsilon_2</math> is a regularization constant.
** <math>\text{RMS}</math> is the root mean square, defined as:
*** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
*** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>

=== 3. Algorithms ===
==== Adafactor for Weighted Vectors ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^n</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update second moment estimate: <math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

==== Adafactor for Weighted Matrices ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^{n \times m}</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update row-wise second moment: <math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
** Update column-wise second moment: <math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
** Update overall second moment estimate: <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

=== 4. Proposed Hyperparameters for Adafactor ===
* Regularization constant 1: <math>\epsilon_1 = 10^{-30}</math>
* Regularization constant 2: <math>\epsilon_2 = 10^{-3}</math>
* Clipping threshold: <math>d = 1</math>
* Relative step size: <math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math>
* Second moment decay: <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>

== Numerical Examples ==
Step-by-step instructions for determining the result of the first iteration.

'''<big>Problem setup</big>'''

'''Initial weights ('''<math>X_0</math>'''):'''

<math>X_0 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}</math>

'''Gradient for first iteration (<math>G_1</math>):'''

Gradient of the loss function with respect to X

<math>G_1 = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math>

'''<big>Hyperparameters setup</big>'''

<math>\epsilon_1 = 10^{-30}</math> (Minimum learning rate scaling factor))

<math>\epsilon_2 = 10^{-3}</math> (Regularization constant)

<math>d = 1</math> (Clipping threshold)

<math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> (Relative step size)

<math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> (Second moment decay)

'''<big>Step 1: Learning Rate Scaling</big>'''

Define the relative step size

<math>\rho_1 = \min(10^{-2}, 1/\sqrt{1})= 10^{-2}</math>

'''Step 1.1: Root Mean Square(RMS) calculation for <math>X_0</math>'''

Root Mean Square(RMS) calculation for <math>X_0</math>

RMS formula

<math>RMS(X_0) = \sqrt{\tfrac{1}{n}\sum_{i=1}^n X_0[i]^2}</math>

Substitute the initial weights

<math>RMS(X_0) = \sqrt{\tfrac{1}{9}(0.72^2+(-0.5)^2+0.9^2+(-1.1)^2+0.8^2+(-0.6)^2+1.2^2+(-0.7)^2+0.4^2)}</math>

<math>RMS(X_0) = \sqrt{\frac{6.85}{9}}\approx 0.806</math>

'''Step 1.2: Find the Learning Rate Scaling ('''<math>\alpha_t</math>'''):'''

Learning rate formula

<math>\alpha_1 = max(\epsilon_2,RMS(X_0))\cdot p_1</math>

Substitute the RMS

<math>\alpha_1 = max(0.001,0.806)\cdot 0.01=0.00806</math>

'''<big>Step 2: Compute <math>G^{2}_t</math> (Element-wise Square of Gradient)</big>'''

Compute the squared value of each element in the gradient matrix '''<math>G_t</math>'''.

<math>G^{2}_1 = \begin{bmatrix} 0.3^2&(-0.2)^2&0.4^2\\ (-0.5)^2&0.6^2&(-0.1)^2\\0.2^2&(-0.4)^2 &0.3^2 \end{bmatrix}</math>

<math>G^{2}_1 = \begin{bmatrix} 0.09& 0.04&0.16\\ 0.25&0.36&0.01\\0.04&0.16&0.09\end{bmatrix}</math>

'''<big>Step 3: Find the moment estimate</big>'''

Compute the exponential moving average of squared gradients to capture the variance or scale of gradients.

'''Step 3.1: Compute row moments (<math>R_t</math>)'''

This equation computes the row-wise second moments ('''<math>R_t</math>''' ) as an exponential moving average of past moments ('''<math>R_{t-1}</math>''') and the current row-wise mean of squared gradients ( <math>G^{2}_t</math> ), with a balance controlled by (<math>\hat{\beta}_{2t}</math>).

For <math>G^{2}_t=\mathbb{R}^{m\times n} </math>

<math>R_t = \hat{\beta_{2t}} \cdot R_{t-1} + (1-\hat{\beta})\cdot (\tfrac{1}{m}\sum_{j=1}^m G^{2}_t[i,j]+\epsilon_1) </math>

Since <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>, for first iteration: <math>\hat{\beta}_{21} = 0</math>. And because <math>\epsilon_1 </math> is too small, we can ignore it. The update of '''<math>R_t</math>''' is:

<math>R_{1} = \tfrac{1}{m}\textstyle \sum_{j=1}^m \displaystyle G^{2}_1[i,j] </math>

Row-wise mean ('''<math>R_t</math>'''):

<math>R_1 = \begin{bmatrix} \tfrac{0.09+0.04+0.16}{3} \\ \tfrac{0.25+0.36+0.01}{3}\\\tfrac{0.04+0.16+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.0967\\ 0.2067\\0.0967\end{bmatrix} </math>

'''Step 3.2: Compute column moments (<math>C_t</math>)'''

The process is same as row moments.

<math>C_t = \hat{\beta}\cdot C_{{t-1}} + (1-\hat{\beta})\cdot (\tfrac{1}{n}\sum_{j=1}^n G^{2}_t[i,j]+\epsilon_1) </math>

Column-wise mean (<math>C_t</math>):

<math>C_1 = \begin{bmatrix} \tfrac{0.09+025+0.04}{3} \\ \tfrac{0.04+0.36+0.16}{3}\\\tfrac{0.16+0.01+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.1267\\ 0.1867\\0.0867\end{bmatrix} </math>

'''Step 3.3: Second Moment Estimate ('''<math>\hat{V_t}</math>''')'''

The Second Moment Estimate is calculated as the outer product of the row moments ('''<math>R_t</math>''') and column moments ('''<math>C_t</math>''').

<math>\hat{V}_t = R_t \otimes C_t</math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0967\\0.2067\\0.0967 \end{bmatrix} \otimes \begin{bmatrix} 0.1267&0.1867&0.0867\\ \end{bmatrix} </math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\ 0.0122&0.0180&0.0084\end{bmatrix} </math>

'''<big>Step 4: Update the vector (<math>U_t </math>)</big>'''

Computed by scaling the gradient matrix '''<math>G_t</math>''' element-wise with the inverse square root of the second moment estimate (<math>\hat{V_t}</math>)

'''step 4.1: Find the vector value of <math>U_t </math>'''

Formula of '''<math>U_t </math>'''

<math>U_t = \frac{G_t}{\sqrt{\hat{V_t}+\epsilon_1}} </math>

Substitute '''<math>C_t</math>''' and <math>V_t</math>

<math>U_1 = \frac{\begin{bmatrix}0.3&-0.2&0.4 \\ -0.5&0.6&-0.1\\0.2&-0.4&0.3 \end{bmatrix}}{\sqrt{\begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\0.0122&0.0180&0.0084 \end{bmatrix}}} </math>

<math>U_1 = \begin{bmatrix} 2.711&-1.489&4.370\\-3.090&3.055&-0.747\\1.807&-2.978&3.278 \end{bmatrix} </math>

'''step 4.2: Clipped Update Vector <math>\hat{U_t} </math>'''

Scale the update vector ( '''<math>U_t </math>''' ) to ensure its RMS value does not exceed a predefined clipping threshold (<math>d </math>), maintaining stability in updates.

Formula of '''<math>\hat{U_t} </math>'''

'''<math>\hat{U_t} = \frac{U_t}{max(1,\tfrac{RMS(U_t)}{d}) } </math>'''

Compute RMS of '''<math>U_t </math>'''

'''<math>RMS(U_1) = \sqrt{\tfrac{1}{9} \sum_{i=1}^9 U_t[i]^2} \approx 3.303 </math>'''

Since RMS('''<math>U_t </math>''')>d, scale '''<math>U_t </math>''' by <math>\tfrac{1}{3.303} </math>

'''<math>\hat{U_1} = \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>'''

'''<big>Step 5: Weight Update (</big>'''<math>X_1 </math>'''<big>)</big>'''

Adjust the weights (<math>X_t </math>) by subtracting the product of the learning rate (<math>\alpha_t </math>) and the clipped update vector (<math>\hat{U_t} </math> ).

<math>X_1 = X_0 - \alpha \cdot \hat{U_t}</math>

The result for first iteration.

<math>X_1 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix} - 0.00806 \cdot \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>

<math>X_1 = \begin{bmatrix} 0.692&-0.496&0.887 \\-1.091&0.791&-0.596\\ 1.195&-0.691&0.391\end{bmatrix} </math>

== Applications ==
== Conclusion ==
== Reference ==

Adafactor

2024-12-11T21:57:54Z

Fall2024 Wiki Team6: /* Numerical Examples */

Adafactor

2024-12-11T21:44:05Z

Fall2024 Wiki Team6: /* Numerical Examples */

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

== Introduction ==
== Problem formulation ==
=== 1. Objective ===
Minimize the loss function <math>f(x)</math>, where <math>x \in R^n</math> and <math>x</math> is the weight vector to be optimized.

=== 2. Parameters ===
*''' Gradient:'''
<math>G_t = \nabla f(x_{t-1})</math>

* '''Second moment estimate:'''

<math> \hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>

* '''Where:'''
** <math>\hat{V}_t</math> is the running average of the squared gradient.
**<math>\hat{\beta}_{2t}</math> is the corrected decay parameter.
**<math>\epsilon_1</math> is a regularization constant.

* '''Step size:'''
<math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math>
* '''Where''':
** <math>\rho_t</math> is the relative step size.
** <math>\epsilon_2</math> is a regularization constant.
** <math>\text{RMS}</math> is the root mean square, defined as:
*** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
*** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>

=== 3. Algorithms ===
==== Adafactor for Weighted Vectors ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^n</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update second moment estimate: <math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

==== Adafactor for Weighted Matrices ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^{n \times m}</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update row-wise second moment: <math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
** Update column-wise second moment: <math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
** Update overall second moment estimate: <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

=== 4. Proposed Hyperparameters for Adafactor ===
* Regularization constant 1: <math>\epsilon_1 = 10^{-30}</math>
* Regularization constant 2: <math>\epsilon_2 = 10^{-3}</math>
* Clipping threshold: <math>d = 1</math>
* Relative step size: <math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math>
* Second moment decay: <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>

== Numerical Examples ==
Step-by-step instructions for determining the result of the first iteration.

'''<big>Problem setup</big>'''

'''Initial weights ('''<math>X_0</math>'''):'''

<math>X_0 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}</math>

'''Initial gradient (<math>G_t</math>):'''

Gradient of the loss function with respect to X

<math>G_t = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math>

'''<big>Hyperparameters setup</big>'''

<math>\epsilon_1 = 10^{-30}</math> (Minimum learning rate scaling factor))

<math>\epsilon_2 = 10^{-3}</math> (Regularization constant)

<math>d = 1</math> (Clipping threshold)

<math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> (Relative step size)

<math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> (Second moment decay)

'''<big>Step 1: Learning Rate Scaling</big>'''

Define the relative step size

<math>\rho_1 = \min(10^{-2}, 1/\sqrt{1})= 10^{-2}</math>

'''Step 1.1: Root Mean Square(RMS) calculation for <math>X_0</math>'''

Root Mean Square(RMS) calculation for <math>X_0</math>

RMS formula

<math>RMS(X_0) = \sqrt{\tfrac{1}{n}\textstyle \sum_{i=1}^n\displaystyle X_0[i]^2}</math>

Substitute the initial weights

<math>RMS(X_0) = \sqrt{\tfrac{1}{9}(0.72^2+(-0.5)^2+0.9^2+(-1.1)^2+0.8^2+(-0.6)^2+1.2^2+(-0.7)^2+0.4^2)}</math>

<math>RMS(X_0) = \sqrt{\frac{6.85}{9}}\approx 0.806</math>

'''Step 1.2: Find the Learning Rate Scaling ('''<math>\alpha_t</math>'''):'''

Learning rate formula

<math>\alpha_1 = max(\epsilon_2,RMS(X_0))\cdot p_1</math>

Substitute the RMS

<math>\alpha_1 = max(0.001,0.806)\cdot 0.01=0.00806</math>

'''<big>Step 2: Compute <math>G^{2}_t</math> (Element-wise Square of Gradient)</big>'''

Compute the squared value of each element in the gradient matrix '''<math>G_t</math>'''.

<math>G^{2}_t = \begin{bmatrix} 0.3^2&(-0.2)^2&0.4^2\\ (-0.5)^2&0.6^2&(-0.1)^2\\0.2^2&(-0.4)^2 &0.3^2 \end{bmatrix}</math>

<math>G^{2}_t = \begin{bmatrix} 0.09& 0.04&0.16\\ 0.25&0.36&0.01\\0.04&0.16&0.09\end{bmatrix}</math>

'''<big>Step 3: Find the moment estimate</big>'''

Compute the exponential moving average of squared gradients to capture the variance or scale of gradients.

'''Step 3.1: Compute row moments (<math>R_t</math>)'''

This equation computes the row-wise second moments ('''<math>R_t</math>''' ) as an exponential moving average of past moments ('''<math>R_{t-1}</math>''') and the current row-wise mean of squared gradients ( <math>G^{2}_t</math> ), with a balance controlled by (<math>\hat{\beta}_{2t}</math>).

For <math>G^{2}_t=\mathbb{R}^{m\times n} </math>

<math>R_t = \hat{\beta_{2t}} \cdot R_{t-1} + (1-\hat{\beta})\cdot (\tfrac{1}{m}\sum_{j=1}^m G^{2}_t[i,j]+\epsilon_1) </math>

Since <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>, for first iteration: <math>\hat{\beta}_{21} = 0</math>. And because <math>\epsilon_1 </math> is too small, we can ignore it. The update of '''<math>R_t</math>''' is:

<math>R_{1} = \tfrac{1}{m}\textstyle \sum_{j=1}^m \displaystyle G^{2}_t[i,j] </math>

Row-wise mean ('''<math>R_t</math>'''):

<math>R_1 = \begin{bmatrix} \tfrac{0.09+0.04+0.16}{3} \\ \tfrac{0.25+0.36+0.01}{3}\\\tfrac{0.04+0.16+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.0967\\ 0.2067\\0.0967\end{bmatrix} </math>

'''Step 3.2: Compute column moments (<math>C_t</math>)'''

The process is same as row moments

<math>C_t = \hat{\beta}\cdot C_{{t-1}} + (1-\hat{\beta})\cdot (\tfrac{1}{n}\sum_{j=1}^n G^{2}_t[i,j]+\epsilon_1) </math>

Column-wise mean (<math>C_t</math>):

<math>C_1 = \begin{bmatrix} \tfrac{0.09+025+0.04}{3} \\ \tfrac{0.04+0.36+0.16}{3}\\\tfrac{0.16+0.01+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.1267\\ 0.1867\\0.0867\end{bmatrix} </math>

'''Step 3.3: Second Moment Estimate ('''<math>\hat{V_t}</math>''')'''

The Second Moment Estimate is calculated as the outer product of the row moments ('''<math>R_t</math>''') and column moments ('''<math>C_t</math>''').

<math>\hat{V}_t = R_t \otimes C_t</math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0967\\0.2067\\0.0967 \end{bmatrix} \otimes \begin{bmatrix} 0.1267&0.1867&0.0867\\ \end{bmatrix} </math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\ 0.0122&0.0180&0.0084\end{bmatrix} </math>

'''<big>Step 4: Update the vector (<math>U_t </math>)</big>'''

Computed by scaling the gradient matrix '''<math>G_t</math>''' element-wise with the inverse square root of the second moment estimate (<math>\hat{V_t}</math>)

'''step 4.1: Find the vector value of <math>U_t </math>'''

Formula of '''<math>U_t </math>'''

<math>U_t = \frac{G_t}{\sqrt{\hat{V_t}+\epsilon_1}} </math>

Substitute '''<math>C_t</math>''' and <math>V_t</math>

<math>U_1 = \frac{\begin{bmatrix}0.3&-0.2&0.4 \\ -0.5&0.6&-0.1\\0.2&-0.4&0.3 \end{bmatrix}}{\sqrt{\begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\0.0122&0.0180&0.0084 \end{bmatrix}}} </math>

<math>U_1 = \begin{bmatrix} 2.711&-1.489&4.370\\-3.090&3.055&-0.747\\1.807&-2.978&3.278 \end{bmatrix} </math>

'''step 4.2: Clipped Update Vector <math>\hat{U_t} </math>'''

Formula of '''<math>\hat{U_t} </math>'''

'''<math>\hat{U_t} = \frac{U_t}{max(1,\tfrac{RMS(U_t)}{d}) } </math>'''

Compute RMS of '''<math>U_t </math>'''

'''<math>RMS(U_1) = \sqrt{\tfrac{1}{9} \sum_{i=1}^9 U_t[i]^2} \approx 3.303 </math>'''

Since RMS('''<math>U_t </math>''')>d, scale '''<math>U_t </math>''' by <math>\tfrac{1}{3.303} </math>

'''<math>\hat{U_t} = \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>'''

'''<big>Step 4: Weight Update (</big>'''<math>X_1 </math>'''<big>)</big>'''

<math>X_1 = X_0 - \alpha \cdot \hat{U_t}</math>

The result for first iteration

<math>X_1 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix} - 0.00806 \cdot \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>

<math>X_1 = \begin{bmatrix} 0.692&-0.496&0.887 \\-1.091&0.791&-0.596\\ 1.195&-0.691&0.391\end{bmatrix} </math>

== Applications ==
== Conclusion ==
== Reference ==

Adafactor

2024-12-11T21:23:34Z

Fall2024 Wiki Team6: /* Numerical Examples */

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

== Introduction ==
== Problem formulation ==
=== 1. Objective ===
Minimize the loss function <math>f(x)</math>, where <math>x \in R^n</math> and <math>x</math> is the weight vector to be optimized.

=== 2. Parameters ===
*''' Gradient:'''
<math>G_t = \nabla f(x_{t-1})</math>

* '''Second moment estimate:'''

<math> \hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>

* '''Where:'''
** <math>\hat{V}_t</math> is the running average of the squared gradient.
**<math>\hat{\beta}_{2t}</math> is the corrected decay parameter.
**<math>\epsilon_1</math> is a regularization constant.

* '''Step size:'''
<math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math>
* '''Where''':
** <math>\rho_t</math> is the relative step size.
** <math>\epsilon_2</math> is a regularization constant.
** <math>\text{RMS}</math> is the root mean square, defined as:
*** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
*** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>

=== 3. Algorithms ===
==== Adafactor for Weighted Vectors ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^n</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update second moment estimate: <math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

==== Adafactor for Weighted Matrices ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^{n \times m}</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update row-wise second moment: <math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
** Update column-wise second moment: <math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
** Update overall second moment estimate: <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

=== 4. Proposed Hyperparameters for Adafactor ===
* Regularization constant 1: <math>\epsilon_1 = 10^{-30}</math>
* Regularization constant 2: <math>\epsilon_2 = 10^{-3}</math>
* Clipping threshold: <math>d = 1</math>
* Relative step size: <math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math>
* Second moment decay: <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>

== Numerical Examples ==
Step-by-step instructions for determining the result of the first iteration.

'''<big>Problem setup</big>'''

'''Initial weights ('''<math>X_0</math>'''):'''

<math>X_0 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}</math>

'''Initial gradient (<math>G_t</math>):'''

Gradient of the loss function with respect to X

<math>G_t = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math>

'''<big>Hyperparameters setup</big>'''

<math>\epsilon_1 = 10^{-30}</math> (Minimum learning rate scaling factor))

<math>\epsilon_2 = 10^{-3}</math> (Regularization constant)

<math>d = 1</math> (Clipping threshold)

<math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> (Relative step size)

<math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> (Second moment decay)

'''<big>Step 1: Learning Rate Scaling</big>'''

Define the relative step size

<math>\rho_1 = \min(10^{-2}, 1/\sqrt{1})= 10^{-2}</math>

'''Step 1.1: Root Mean Square(RMS) calculation for <math>X_0</math>'''

Root Mean Square(RMS) calculation for <math>X_0</math>

RMS formula

<math>RMS(X_0) = \sqrt{\tfrac{1}{n}\textstyle \sum_{i=1}^n\displaystyle X_0[i]^2}</math>

Substitute the initial weights

<math>RMS(X_0) = \sqrt{\tfrac{1}{9}(0.72^2+(-0.5)^2+0.9^2+(-1.1)^2+0.8^2+(-0.6)^2+1.2^2+(-0.7)^2+0.4^2)}</math>

<math>RMS(X_0) = \sqrt{\frac{6.85}{9}}\approx 0.806</math>

'''Step 1.2: Find the Learning Rate Scaling ('''<math>\alpha_t</math>'''):'''

Learning rate formula

<math>\alpha_1 = max(\epsilon_2,RMS(X_0))\cdot p_1</math>

Substitute the RMS

<math>\alpha_1 = max(0.001,0.806)\cdot 0.01=0.00806</math>

'''<big>Step 2: Compute <math>G^{2}_t</math> (Element-wise Square of Gradient)</big>'''

Compute the squared value of each element in the gradient matrix '''<math>G_t</math>'''.

<math>G^{2}_t = \begin{bmatrix} 0.3^2&(-0.2)^2&0.4^2\\ (-0.5)^2&0.6^2&(-0.1)^2\\0.2^2&(-0.4)^2 &0.3^2 \end{bmatrix}</math>

<math>G^{2}_t = \begin{bmatrix} 0.09& 0.04&0.16\\ 0.25&0.36&0.01\\0.04&0.16&0.09\end{bmatrix}</math>

'''<big>Step 3: Find the moment estimate</big>'''

Compute the exponential moving average of squared gradients to capture the variance or scale of gradients.

'''Step 3.1: Compute row moments (<math>R_t</math>)'''

This equation computes the row-wise second moments ('''<math>R_t</math>''' ) as an exponential moving average of past moments ('''<math>R_{t-1}</math>''') and the current row-wise mean of squared gradients ( <math>G^{2}_t</math> ), with a balance controlled by (<math>\hat{\beta}_{2t}</math>).

For <math>G^{2}_t=\mathbb{R}^{m\times n} </math>

<math>R_t = \hat{\beta_{2t}} \cdot R_{t-1} + (1-\hat{\beta})\cdot (\tfrac{1}{m}\textstyle \sum_{j=1}^m \displaystyle G^{2}_t[i,j]+\epsilon_1) </math>

Since <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>, for first iteration: <math>\hat{\beta}_{21} = 0</math>. And because <math>\epsilon_1 </math> is too small, we can ignore it. The update of '''<math>R_t</math>''' is:

<math>R_{1} = \tfrac{1}{m}\textstyle \sum_{j=1}^m \displaystyle G^{2}_t[i,j] </math>

Row-wise mean ('''<math>R_t</math>'''):

<math>R_1 = \begin{bmatrix} \tfrac{0.09+0.04+0.16}{3} \\ \tfrac{0.25+0.36+0.01}{3}\\\tfrac{0.04+0.16+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.0967\\ 0.2067\\0.0967\end{bmatrix} </math>

'''Step 3.2: Compute column moments (<math>C_t</math>)'''

The process is same as row moments

<math>C_t = \hat{\beta}\cdot C_{{t-1}} + (1-\hat{\beta})\cdot (\tfrac{1}{n}\textstyle \sum_{j=1}^n \displaystyle G^{2}_t[i,j]+\epsilon_1) </math>

Column-wise mean (<math>C_t</math>):

<math>C_1 = \begin{bmatrix} \tfrac{0.09+025+0.04}{3} \\ \tfrac{0.04+0.36+0.16}{3}\\\tfrac{0.16+0.01+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.1267\\ 0.1867\\0.0867\end{bmatrix} </math>

'''Step 3.3: Second Moment Estimate ('''<math>V_t</math>''')'''

The Second Moment Estimate is calculated as the outer product of the row moments ('''<math>R_t</math>''') and column moments ('''<math>C_t</math>''').

<math>\hat{V}_t = R_t \otimes C_t</math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0967\\0.2067\\0.0967 \end{bmatrix} \otimes \begin{bmatrix} 0.1267&0.1867&0.0867\\ \end{bmatrix} </math>

<math>\hat{V}_1 = \begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\ 0.0122&0.0180&0.0084\end{bmatrix} </math>

'''<big>Step 4: Update the vector (<math>U_t </math>)</big>'''

'''step 4.1: Find the vector value of <math>U_t </math>'''

Formula of '''<math>U_t </math>'''

<math>U_t = \frac{G_t}{\sqrt{V_t+\epsilon_1}} </math>

Substitute '''<math>C_t</math>''' and <math>V_t</math>

<math>U_1 = \frac{\begin{bmatrix}0.3&-0.2&0.4 \\ -0.5&0.6&-0.1\\0.2&-0.4&0.3 \end{bmatrix}}{\sqrt{\begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\0.0122&0.0180&0.0084 \end{bmatrix}}} </math>

<math>U_1 = \begin{bmatrix} 2.711&-1.489&4.370\\-3.090&3.055&-0.747\\1.807&-2.978&3.278 \end{bmatrix} </math>

'''step 4.2: Clipped Update Vector <math>\hat{U_t} </math>'''

Formula of '''<math>\hat{U_t} </math>'''

'''<math>\hat{U_t} = \frac{U_t}{max(1,\tfrac{RMS(U_t)}{d}) } </math>'''

Compute RMS of '''<math>U_t </math>'''

'''<math>RMS(U_t) = \sqrt{\tfrac{1}{9} \sum_{i=1}^9 U_t[i]^2} \approx 3.303 </math>'''

Since RMS('''<math>U_t </math>''')>d, scale '''<math>U_t </math>''' by <math>\tfrac{1}{3.303} </math>

'''<math>\hat{U_t} = \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>'''

'''<big>Step 4: Weight Update (</big>'''<math>X_1 </math>'''<big>)</big>'''

<math>X_1 = X_0 - \alpha \cdot \hat{U_t}</math>

The result for first iteration

<math>X_1 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix} - 0.00806 \cdot \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>

<math>X_1 = \begin{bmatrix} 0.692&-0.496&0.887 \\-1.091&0.791&-0.596\\ 1.195&-0.691&0.391\end{bmatrix} </math>

== Applications ==
== Conclusion ==
== Reference ==

Adafactor

2024-12-11T17:10:11Z

Fall2024 Wiki Team6: /* Numerical Examples */

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

== Introduction ==
== Problem formulation ==
=== 1. Objective ===
Minimize the loss function <math>f(x)</math>, where <math>x \in R^n</math> and <math>x</math> is the weight vector to be optimized.

=== 2. Parameters ===
*''' Gradient:'''
<math>G_t = \nabla f(x_{t-1})</math>

* '''Second moment estimate:'''

<math> \hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>

* '''Where:'''
** <math>\hat{V}_t</math> is the running average of the squared gradient.
**<math>\hat{\beta}_{2t}</math> is the corrected decay parameter.
**<math>\epsilon_1</math> is a regularization constant.

* '''Step size:'''
<math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math>
* '''Where''':
** <math>\rho_t</math> is the relative step size.
** <math>\epsilon_2</math> is a regularization constant.
** <math>\text{RMS}</math> is the root mean square, defined as:
*** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
*** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>

=== 3. Algorithms ===
==== Adafactor for Weighted Vectors ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^n</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update second moment estimate: <math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

==== Adafactor for Weighted Matrices ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^{n \times m}</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update row-wise second moment: <math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
** Update column-wise second moment: <math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
** Update overall second moment estimate: <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

=== 4. Proposed Hyperparameters for Adafactor ===
* Regularization constant 1: <math>\epsilon_1 = 10^{-30}</math>
* Regularization constant 2: <math>\epsilon_2 = 10^{-3}</math>
* Clipping threshold: <math>d = 1</math>
* Relative step size: <math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math>
* Second moment decay: <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>

== Numerical Examples ==
Step-by-step instructions for determining the result of the first iteration.

'''<big>Problem setup</big>'''

'''Initial weights ('''<math>X_0</math>'''):'''

<math>X_0 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}</math>

'''Initial gradient (<math>G_t</math>):'''

Gradient of the loss function with respect to X

<math>G_t = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math>

'''<big>Hyperparameters setup</big>'''

<math>\epsilon_1 = 10^{-30}</math> (Minimum learning rate scaling factor))

<math>\epsilon_2 = 10^{-3}</math> (Regularization constant)

<math>d = 1</math> (Clipping threshold)

<math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> (Relative step size)

<math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> (Second moment decay)

'''<big>Step 1: Learning Rate Scaling</big>'''

Define the relative step size

<math>\rho_1 = \min(10^{-2}, 1/\sqrt{1})= 10^{-2}</math>

'''Step 1.1: Root Mean Square(RMS) calculation for <math>X_0</math>'''

Root Mean Square(RMS) calculation for <math>X_0</math>

RMS formula

<math>RMS(X_0) = \sqrt{\tfrac{1}{n}\textstyle \sum_{i=1}^n\displaystyle X_0[i]^2}</math>

Substitute the initial weights

<math>RMS(X_0) = \sqrt{\tfrac{1}{9}(0.72^2+(-0.5)^2+0.9^2+(-1.1)^2+0.8^2+(-0.6)^2+1.2^2+(-0.7)^2+0.4^2)}</math>

<math>RMS(X_0) = \sqrt{\frac{6.85}{9}}\approx 0.806</math>

'''Step 1.2: Find the Learning Rate Scaling ('''<math>\alpha_t</math>'''):'''

Learning rate formula

<math>\alpha_1 = max(\epsilon_2,RMS(X_0))\cdot p_1</math>

Substitute the RMS

<math>\alpha_1 = max(0.001,0.806)\cdot 0.01=0.00806</math>

'''<big>Step 2: Compute <math>G^{2}_t</math> (Element-wise Square of Gradient)</big>'''

Square the gradient value

<math>G^{2}_t = \begin{bmatrix} 0.3^2&(-0.2)^2&0.4^2\\ (-0.5)^2&0.6^2&(-0.1)^2\\0.2^2&(-0.4)^2 &0.3^2 \end{bmatrix}</math>

<math>G^{2}_t = \begin{bmatrix} 0.09& 0.04&0.16\\ 0.25&0.36&0.01\\0.04&0.16&0.09\end{bmatrix}</math>

'''<big>Step 3: Find the moment estimate</big>'''

'''Step 3.1: Compute row moments (<math>R_t</math>)'''

This equation computes the row-wise second moments ('''<math>R_t</math>''' ) as an exponential moving average of past moments ('''<math>R_{t-1}</math>''') and the current row-wise mean of squared gradients ( <math>G^{2}_t</math> ), with a balance controlled by (<math>\hat{\beta}_{2t}</math>).

For <math>G^{2}_t=\mathbb{R}^{m\times n} </math>

<math>R_t = \hat{\beta_{2t}} \cdot R_{t-1} + (1-\hat{\beta})\cdot (\tfrac{1}{m}\textstyle \sum_{j=1}^m \displaystyle G^{2}_t[i,j]+\epsilon_1) </math>

Since <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>, for first iteration: <math>\hat{\beta}_{21} = 0</math>. And because <math>\epsilon_1 </math> is too small, we ignore it. The update of '''<math>R_1</math>''' is:

<math>R_{1} = \tfrac{1}{m}\textstyle \sum_{j=1}^m \displaystyle G^{2}_t[i,j] </math>

Row-wise mean ('''<math>R_t</math>'''):

<math>R_1 = \begin{bmatrix} \tfrac{0.09+0.04+0.16}{3} \\ \tfrac{0.25+0.36+0.01}{3}\\\tfrac{0.04+0.16+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.0967\\ 0.2067\\0.0967\end{bmatrix} </math>

'''Step 3.2: Compute column moments (<math>C_t</math>)'''

The process is same as row moments

<math>C_t = \hat{\beta}\cdot C_{{t-1}} + (1-\hat{\beta})\cdot (\tfrac{1}{n}\textstyle \sum_{j=1}^n \displaystyle G^{2}_t[i,j]+\epsilon_1) </math>

Column-wise mean (<math>C_t</math>):

<math>C_1 = \begin{bmatrix} \tfrac{0.09+025+0.04}{3} \\ \tfrac{0.04+0.36+0.16}{3}\\\tfrac{0.16+0.01+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.1267\\ 0.1867\\0.0867\end{bmatrix} </math>

'''Step 3.3: Second Moment Estimate ('''<math>V_t</math>''')'''

The Second Moment Estimate is calculated as the outer product of the row moments ('''<math>R_t</math>''') and column moments ('''<math>C_t</math>''').

<math>V_t = R_t \otimes C_t</math>

<math>V_t = \begin{bmatrix} 0.0967\\0.2067\\0.0967 \end{bmatrix} \otimes \begin{bmatrix} 0.1267&0.1867&0.0867\\ \end{bmatrix} </math>

<math>V_t = \begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\ 0.0122&0.0180&0.0084\end{bmatrix} </math>

'''<big>Step 4: Update the vector (<math>U_t </math>)</big>'''

'''step 4.1: Find the vector value of <math>U_t </math>'''

Formula of '''<math>U_t </math>'''

<math>U_t = \frac{G_t}{\sqrt{V_t+\epsilon_1}} </math>

Substitute '''<math>C_t</math>''' and <math>V_t</math>

<math>U_1 = \frac{\begin{bmatrix}0.3&-0.2&0.4 \\ -0.5&0.6&-0.1\\0.2&-0.4&0.3 \end{bmatrix}}{\sqrt{\begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\0.0122&0.0180&0.0084 \end{bmatrix}}} </math>

<math>U_1 = \begin{bmatrix} 2.711&-1.489&4.370\\-3.090&3.055&-0.747\\1.807&-2.978&3.278 \end{bmatrix} </math>

'''step 4.2: Clipped Update Vector <math>\hat{U_t} </math>'''

Formula of '''<math>\hat{U_t} </math>'''

'''<math>\hat{U_t} = \frac{U_t}{max(1,\tfrac{RMS(U_t)}{d}) } </math>'''

Calculate RMS of '''<math>U_t </math>'''

'''<math>RMS(U_t) = \sqrt{\tfrac{1}{9} \sum_{i=1}^9 U_t[i]^2} \approx 3.303 </math>'''

Since RMS('''<math>U_t </math>''')>d, scale '''<math>U_t </math>''' by <math>\tfrac{1}{3.303} </math>

'''<math>\hat{U_t} = \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>'''

'''<big>Step 4: Weight Update (</big>'''<math>X_1 </math>'''<big>)</big>'''

== Applications ==
== Conclusion ==
== Reference ==

Adafactor

2024-12-11T07:00:35Z

Fall2024 Wiki Team6: /* Numerical Examples */

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

== Introduction ==
== Problem formulation ==
=== 1. Objective ===
Minimize the loss function <math>f(x)</math>, where <math>x \in R^n</math> and <math>x</math> is the weight vector to be optimized.

=== 2. Parameters ===
*''' Gradient:'''
<math>G_t = \nabla f(x_{t-1})</math>

* '''Second moment estimate:'''

<math> \hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>

* '''Where:'''
** <math>\hat{V}_t</math> is the running average of the squared gradient.
**<math>\hat{\beta}_{2t}</math> is the corrected decay parameter.
**<math>\epsilon_1</math> is a regularization constant.

* '''Step size:'''
<math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math>
* '''Where''':
** <math>\rho_t</math> is the relative step size.
** <math>\epsilon_2</math> is a regularization constant.
** <math>\text{RMS}</math> is the root mean square, defined as:
*** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
*** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>

=== 3. Algorithms ===
==== Adafactor for Weighted Vectors ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^n</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update second moment estimate: <math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

==== Adafactor for Weighted Matrices ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^{n \times m}</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update row-wise second moment: <math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
** Update column-wise second moment: <math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
** Update overall second moment estimate: <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

=== 4. Proposed Hyperparameters for Adafactor ===
* Regularization constant 1: <math>\epsilon_1 = 10^{-30}</math>
* Regularization constant 2: <math>\epsilon_2 = 10^{-3}</math>
* Clipping threshold: <math>d = 1</math>
* Relative step size: <math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math>
* Second moment decay: <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>

== Numerical Examples ==
Step-by-step instructions for determining the result of the first iteration.

'''<big>Problem setup</big>'''

'''Initial weights ('''<math>X_0</math>'''):'''

<math>X_0 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}</math>

'''Initial gradient (<math>G_t</math>):'''

Gradient of the loss function with respect to X

<math>G_t = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math>

'''<big>Hyperparameters setup</big>'''

<math>\epsilon_1 = 10^{-30}</math> (Minimum learning rate scaling factor))

<math>\epsilon_2 = 10^{-3}</math> (Regularization constant)

<math>d = 1</math> (Clipping threshold)

<math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> (Relative step size)

<math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> (Second moment decay)

'''<big>Step 1: Learning Rate Scaling</big>'''

Define the relative step size

<math>\rho_1 = \min(10^{-2}, 1/\sqrt{1})= 10^{-2}</math>

'''Step 1.1: Root Mean Square(RMS) calculation for <math>X_0</math>'''

Root Mean Square(RMS) calculation for <math>X_0</math>

RMS formula

<math>RMS(X_0) = \sqrt{\tfrac{1}{n}\textstyle \sum_{i=1}^n\displaystyle X_0[i]^2}</math>

Substitute the initial weights

<math>RMS(X_0) = \sqrt{\tfrac{1}{9}(0.72^2+(-0.5)^2+0.9^2+(-1.1)^2+0.8^2+(-0.6)^2+1.2^2+(-0.7)^2+0.4^2)}</math>

<math>RMS(X_0) = \sqrt{\frac{6.85}{9}}\approx 0.806</math>

'''Step 1.2: Find the Learning Rate Scaling ('''<math>\alpha_t</math>'''):'''

Learning rate formula

<math>\alpha_1 = max(\epsilon_2,RMS(X_0))\cdot p_1</math>

Substitute the RMS

<math>\alpha_1 = max(0.001,0.806)\cdot 0.01=0.00806</math>

'''<big>Step 2: Compute <math>G^{2}_t</math> (Element-wise Square of Gradient)</big>'''

Square the gradient value

<math>G^{2}_t = \begin{bmatrix} 0.3^2&(-0.2)^2&0.4^2\\ (-0.5)^2&0.6^2&(-0.1)^2\\0.2^2&(-0.4)^2 &0.3^2 \end{bmatrix}</math>

<math>G^{2}_t = \begin{bmatrix} 0.09& 0.04&0.16\\ 0.25&0.36&0.01\\0.04&0.16&0.09\end{bmatrix}</math>

'''<big>Step 3: Find the moment estimate</big>'''

'''Step 3.1: Compute row moments (<math>R_t</math>)'''

This equation computes the row-wise second moments ('''<math>R_t</math>''' ) as an exponential moving average of past moments ('''<math>R_{t-1}</math>''') and the current row-wise mean of squared gradients ( <math>G^{2}_t</math> ), with a balance controlled by (<math>\hat{\beta}_{2t}</math>).

For <math>G^{2}_t=\mathbb{R}^{m\times n} </math>

<math>R_t = \hat{\beta_{2t}} \cdot R_{t-1} + (1-\hat{\beta})\cdot (\tfrac{1}{m}\textstyle \sum_{j=1}^m \displaystyle G^{2}_t[i,j]+\epsilon_1) </math>

Since <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>, for first iteration: <math>\hat{\beta}_{21} = 0</math>. And because <math>\epsilon_1 </math> is too small, we ignore it. The update of '''<math>R_1</math>''' is:

<math>R_{1} = \tfrac{1}{m}\textstyle \sum_{j=1}^m \displaystyle G^{2}_t[i,j] </math>

Row-wise mean ('''<math>R_t</math>'''):

<math>R_1 = \begin{bmatrix} \tfrac{0.09+0.04+0.16}{3} \\ \tfrac{0.25+0.36+0.01}{3}\\\tfrac{0.04+0.16+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.0967\\ 0.2067\\0.0967\end{bmatrix} </math>

'''Step 3.2: Compute column moments (<math>C_t</math>)'''

The process is same as row moments

<math>C_t = \hat{\beta}\cdot C_{{t-1}} + (1-\hat{\beta})\cdot (\tfrac{1}{n}\textstyle \sum_{j=1}^n \displaystyle G^{2}_t[i,j]+\epsilon_1) </math>

Column-wise mean:

<math>C_1 = \begin{bmatrix} \tfrac{0.09+025+0.04}{3} \\ \tfrac{0.04+0.36+0.16}{3}\\\tfrac{0.16+0.01+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.1267\\ 0.1867\\0.0867\end{bmatrix} </math>

'''Step 3.3: Second Moment Estimate ('''<math>V_t</math>''')'''

The Second Moment Estimate is calculated as the outer product of the row moments ('''<math>R_t</math>''') and column moments ('''<math>C_t</math>''').

<math>V_t = R_t \otimes C_t</math>

<math>V_t = \begin{bmatrix} 0.0967\\0.2067\\0.0967 \end{bmatrix} \otimes \begin{bmatrix} 0.1267&0.1867&0.0867\\ \end{bmatrix} </math>

<math>V_t = \begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\ 0.0122&0.0180&0.0084\end{bmatrix} </math>

'''<big>Step 4: Update the vector (<math>U_t </math>)</big>'''

'''step 4.1: Find the vector value of <math>U_t </math>'''

Formula of '''<math>U_t </math>'''

<math>U_t = \frac{G_t}{\sqrt{V_t+\epsilon_1}} </math>

Substitute '''<math>C_t</math>''' and <math>V_t</math>

<math>U_1 = \frac{\begin{bmatrix}0.3&-0.2&0.4 \\ -0.5&0.6&-0.1\\0.2&-0.4&0.3 \end{bmatrix}}{\sqrt{\begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\0.0122&0.0180&0.0084 \end{bmatrix}}} </math>

<math>U_1 = \begin{bmatrix} 2.711&-1.489&4.370\\-3.090&3.055&-0.747\\1.807&-2.978&3.278 \end{bmatrix} </math>

'''step 4.2: Clipped Update Vector <math>\hat{U_t} </math>'''

Formula of '''<math>\hat{U_t} </math>'''

'''<math>\hat{U_t} = \frac{U_t}{max(1,\tfrac{RMS(U_t)}{d}) } </math>'''

Calculate RMS of '''<math>U_t </math>'''

'''<math>RMS(U_t) = \sqrt{\tfrac{1}{9} \sum_{i=1}^9 U_t[i]^2} \approx 3.303 </math>'''

Since RMS('''<math>U_t </math>''')>d, scale '''<math>U_t </math>''' by <math>\tfrac{1}{3.303} </math>

'''<math>\hat{U_t} = \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>'''

'''<big>Step 4: Weight Update (</big>'''<math>X_1 </math>'''<big>)</big>'''

== Applications ==
== Conclusion ==
== Reference ==

Adafactor

2024-12-11T06:58:10Z

Fall2024 Wiki Team6: /* Numerical Examples */

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

== Introduction ==
== Problem formulation ==
=== 1. Objective ===
Minimize the loss function <math>f(x)</math>, where <math>x \in R^n</math> and <math>x</math> is the weight vector to be optimized.

=== 2. Parameters ===
*''' Gradient:'''
<math>G_t = \nabla f(x_{t-1})</math>

* '''Second moment estimate:'''

<math> \hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>

* '''Where:'''
** <math>\hat{V}_t</math> is the running average of the squared gradient.
**<math>\hat{\beta}_{2t}</math> is the corrected decay parameter.
**<math>\epsilon_1</math> is a regularization constant.

* '''Step size:'''
<math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math>
* '''Where''':
** <math>\rho_t</math> is the relative step size.
** <math>\epsilon_2</math> is a regularization constant.
** <math>\text{RMS}</math> is the root mean square, defined as:
*** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
*** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>

=== 3. Algorithms ===
==== Adafactor for Weighted Vectors ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^n</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update second moment estimate: <math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

==== Adafactor for Weighted Matrices ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^{n \times m}</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update row-wise second moment: <math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
** Update column-wise second moment: <math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
** Update overall second moment estimate: <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

=== 4. Proposed Hyperparameters for Adafactor ===
* Regularization constant 1: <math>\epsilon_1 = 10^{-30}</math>
* Regularization constant 2: <math>\epsilon_2 = 10^{-3}</math>
* Clipping threshold: <math>d = 1</math>
* Relative step size: <math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math>
* Second moment decay: <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>

== Numerical Examples ==
Step-by-step instructions for determining the result of the first iteration.

'''<big>Problem setup</big>'''

'''Initial weights ('''<math>X_0</math>'''):'''

<math>X_0 = \begin{bmatrix} 0.7 &-0.5& 0.9\\ -1.1 & 0.8& -1.6\\1.2&-0.7& 0.4 \end{bmatrix}</math>

'''Initial gradient (<math>G_t</math>):'''

Gradient of the loss function with respect to X

<math>G_t = \begin{bmatrix} 0.3&-0.2&0.4\\ -0.5&0.6&-0.1\\0.2&-0.4 &0.3 \end{bmatrix}</math>

'''<big>Hyperparameters setup</big>'''

<math>\epsilon_1 = 10^{-30}</math> (Minimum learning rate scaling factor))

<math>\epsilon_2 = 10^{-3}</math> (Regularization constant)

<math>d = 1</math> (Clipping threshold)

<math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math> (Relative step size)

<math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math> (Second moment decay)

'''<big>Step 1: Learning Rate Scaling</big>'''

Define the relative step size

<math>\rho_1 = \min(10^{-2}, 1/\sqrt{1})= 10^{-2}</math>

'''Step 1.1: Root Mean Square(RMS) calculation for <math>X_0</math>'''

Root Mean Square(RMS) calculation for <math>X_0</math>

RMS formula

<math>RMS(X_0) = \sqrt{\tfrac{1}{n}\textstyle \sum_{i=1}^n\displaystyle X_0[i]^2}</math>

Substitute the initial weights

<math>RMS(X_0) = \sqrt{\tfrac{1}{9}(0.72^2+(-0.5)^2+0.9^2+(-1.1)^2+0.8^2+(-0.6)^2+1.2^2+(-0.7)^2+0.4^2)}</math>

<math>RMS(X_0) = \sqrt{\frac{6.85}{9}}\approx 0.806</math>

'''Step 1.2: Find the Learning Rate Scaling ('''<math>\alpha_t</math>'''):'''

Learning rate formula

<math>\alpha_1 = max(\epsilon_2,RMS(X_0))\cdot p_1</math>

Substitute the RMS

<math>\alpha_1 = max(0.001,0.806)\cdot 0.01=0.00806</math>

'''<big>Step 2: Compute <math>G^{2}_t</math> (Element-wise Square of Gradient)</big>'''

Square the gradient value

<math>G^{2}_t = \begin{bmatrix} 0.3^2&(-0.2)^2&0.4^2\\ (-0.5)^2&0.6^2&(-0.1)^2\\0.2^2&(-0.4)^2 &0.3^2 \end{bmatrix}</math>

<math>G^{2}_t = \begin{bmatrix} 0.09& 0.04&0.16\\ 0.25&0.36&0.01\\0.04&0.16&0.09\end{bmatrix}</math>

'''<big>Step 3: Find the moment estimate</big>'''

'''Step 3.1: Compute row moments (<math>R_t</math>)'''

This equation computes the row-wise second moments ('''<math>R_t</math>''' ) as an exponential moving average of past moments ('''<math>R_{t-1}</math>''') and the current row-wise mean of squared gradients ( <math>G^{2}_t</math> ), with a balance controlled by (<math>\hat{\beta}_{2t}</math>).

For <math>G^{2}_t=\mathbb{R}^{m\times n} </math>

<math>R_t = \hat{\beta_{2t}} \cdot R_{t-1} + (1-\hat{\beta})\cdot (\tfrac{1}{m}\textstyle \sum_{j=1}^m \displaystyle G^{2}_t[i,j]+\epsilon_1) </math>

Since <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>, for first iteration: <math>\hat{\beta}_{21} = 0</math>. And because <math>\epsilon_1 </math> is too small, we ignore it. The update of '''<math>R_1</math>''' is:

<math>R_{1} = \tfrac{1}{m}\textstyle \sum_{j=1}^m \displaystyle G^{2}_t[i,j] </math>

Row-wise mean ('''<math>R_t</math>'''):

<math>R_1 = \begin{bmatrix} \tfrac{0.09+0.04+0.16}{3} \\ \tfrac{0.25+0.36+0.01}{3}\\\tfrac{0.04+0.16+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.0967\\ 0.2067\\0.0967\end{bmatrix} </math>

'''Step 3.2: Compute column moments (<math>C_t</math>)'''

The prcoess is same as row moments

<math>C_t = \hat{\beta}\cdot C_{{t-1}} + (1-\hat{\beta})\cdot (\tfrac{1}{n}\textstyle \sum_{j=1}^n \displaystyle G^{2}_t[i,j]+\epsilon_1) </math>

Column Moments ('''<math>C_t</math>'''):

<math>C_1 = \begin{bmatrix} \tfrac{0.09+025+0.04}{3} \\ \tfrac{0.04+0.36+0.16}{3}\\\tfrac{0.16+0.01+0.09}{3} \end{bmatrix} = \begin{bmatrix} 0.1267\\ 0.1867\\0.0867\end{bmatrix} </math>

'''Step 3.3: Second Moment Estimate ('''<math>V_t</math>''')'''

The Second Moment Estimate is calculated as the outer product of the row moments ('''<math>R_t</math>''') and column moments ('''<math>C_t</math>''').

<math>V_t = R_t \otimes C_t</math>

<math>V_t = \begin{bmatrix} 0.0967\\0.2067\\0.0967 \end{bmatrix} \otimes \begin{bmatrix} 0.1267&0.1867&0.0867\\ \end{bmatrix} </math>

<math>V_t = \begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\ 0.0122&0.0180&0.0084\end{bmatrix} </math>

'''<big>Step 4: Update the vector (<math>U_t </math>)</big>'''

'''step 4.1: Find the vector value of <math>U_t </math>'''

Formula of '''<math>U_t </math>'''

<math>U_t = \frac{G_t}{\sqrt{V_t+\epsilon_1}} </math>

Substitute '''<math>C_t</math>''' and <math>V_t</math>

<math>U_1 = \frac{\begin{bmatrix}0.3&-0.2&0.4 \\ -0.5&0.6&-0.1\\0.2&-0.4&0.3 \end{bmatrix}}{\sqrt{\begin{bmatrix} 0.0122&0.0180&0.0084\\ 0.0262&0.0386&0.0179\\0.0122&0.0180&0.0084 \end{bmatrix}}} </math>

<math>U_1 = \begin{bmatrix} 2.711&-1.489&4.370\\-3.090&3.055&-0.747\\1.807&-2.978&3.278 \end{bmatrix} </math>

'''step 4.2: Clipped Update Vector <math>\hat{U_t} </math>'''

Formula of '''<math>\hat{U_t} </math>'''

'''<math>\hat{U_t} = \frac{U_t}{max(1,\tfrac{RMS(U_t)}{d}) } </math>'''

Calculate RMS of '''<math>U_t </math>'''

'''<math>RMS(U_t) = \sqrt{\tfrac{1}{9} \sum_{i=1}^9 U_t[i]^2} \approx 3.303 </math>'''

Since RMS('''<math>U_t </math>''')>d, scale '''<math>U_t </math>''' by <math>\tfrac{1}{3.303} </math>

'''<math>\hat{U_t} = \begin{bmatrix} 0.965&-0.53&1.556 \\-1.1&1.088&-0.266\\0.664&-1.06&1.167 \end{bmatrix} </math>'''

'''<big>Step 4: Weight Update (</big>'''<math>X_1 </math>'''<big>)</big>'''

== Applications ==
== Conclusion ==
== Reference ==

Adafactor

2024-12-11T04:26:58Z

Fall2024 Wiki Team6: /* Numerical Examples */

Adafactor

2024-12-11T04:23:58Z

Fall2024 Wiki Team6: /* Why Adafactor is more memory efficient, compared to Adam */

Adafactor

2024-12-11T04:23:50Z

Fall2024 Wiki Team6: /* Why Clipping */

Adafactor

2024-12-11T04:23:40Z

Fall2024 Wiki Team6: /* 5.Discussion */

Adafactor

2024-12-11T04:23:32Z

Fall2024 Wiki Team6: /* Why Adafactor is more memory efficient, compared to Adam */

Adafactor

2024-12-11T04:23:15Z

Fall2024 Wiki Team6: /* Why Clipping */

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

== Introduction ==
== Problem formulation ==
=== 1. Objective ===
Minimize the loss function <math>f(x)</math>, where <math>x \in R^n</math> and <math>x</math> is the weight vector to be optimized.

=== 2. Parameters ===
*''' Gradient:'''
<math>G_t = \nabla f(x_{t-1})</math>

* '''Second moment estimate:'''

<math> \hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>

* '''Where:'''
** <math>\hat{V}_t</math> is the running average of the squared gradient.
**<math>\hat{\beta}_{2t}</math> is the corrected decay parameter.
**<math>\epsilon_1</math> is a regularization constant.

* '''Step size:'''
<math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math>
* '''Where''':
** <math>\rho_t</math> is the relative step size.
** <math>\epsilon_2</math> is a regularization constant.
** <math>\text{RMS}</math> is the root mean square, defined as:
*** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
*** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>
=== 3. Algorithms ===
==== Adafactor for Weighted Vectors ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^n</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update second moment estimate: <math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

==== Adafactor for Weighted Matrices ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^{n \times m}</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update row-wise second moment: <math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
** Update column-wise second moment: <math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
** Update overall second moment estimate: <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

=== Why Adafactor is more memory efficient, compared to Adam ===
'''Row-wise and Column-wise Second Moment Updates'''
*<math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
*<math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
Instead of storing the full <math>G_t^2</math>, Adafactor computes the row and column respectively, which reduces the memory requirements from <math>O(n\times m)</math> to <math>O(n + m)</math>

'''Factored Representation of the Second Moment'''
* <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
This updates the second momentum based on the outer product <math>R_t C_t</math>.
*However, this is not <math>O(n\times m)</math> since
** The operation is performed element-wise, so it actually never materializes <math>\hat{V_t}</math> as a <math>n\times n</math> matrix
** It also only storing <math>R_t</math>and <math> C_t</math> instead of storage the full second-moment matrix

=== 4. Proposed Hyperparameters for Adafactor ===
* '''Regularization constant 1''': <math>\epsilon_1 = 10^{-30}</math>
* Ensures numerical stability by preventing division by zero in the calculation of second-moment estimates, so the numerical value should be very close to zero
* '''Regularization constant 2''': <math>\epsilon_2 = 10^{-3}</math>
* Help to stabilize parameter updates by controlling the effect of second-moment scaling in low-magnitude scenarios. Compared to <math>\epsilon_2</math>, a relatively larger value ensures the stability of noise and low-magnitude scenarios.
* '''Clipping threshold''': <math>d = 1</math>
* A threshold of 1 balances stability and learning efficiency. It avoids excessive suppression of large gradients, which could hinder learning, while still protecting against extreme updates that could destabilize the model.
* '''Relative step size''': <math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math>
** <math>min(10^-2, ...)</math> can caps the learning rate at 10^-2, which is a empirical found for upper bound
** <math>\frac{1}{\sqrt{t}}</math> This step size promote convergence of the model. This rate ensures a balance between sufficient exploration in early iteration and stability in later iterations
* '''Second moment decay''': <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>
** 1-...: ensures the decay factor remains close to 1
** <math>t^{-0,8}</math> the power 0.8 ensures a balance between rapid adaptation in early training and later iterations

=== 5.Discussion ===
=== Why Clipping ===
Adafactor employs clipping to maintain numerical stability, especially since it is designed for use with very large models and often works with unscaled learning rates.
* Clipping prevents the update step from becoming very large, which would destabilize training
* Clipping mitigates the effects of very large gradients preventing numerical instability
Therefore, implementing clipping helps ensure stability and efficient training without requiring per-parameter scaling like Adam.

=== Why Adafactor is more memory efficient, compared to Adam ===
'''Row-wise and Column-wise Second Moment Updates'''
*<math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
*<math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
Instead of storing the full <math>G_t^2</math>, Adafactor computes the row and column respectively, which reduces the memory requirements from <math>O(n\times m)</math> to <math>O(n + m)</math>

'''Factored Representation of the Second Moment'''
* <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
This updates the second momentum based on the outer product <math>R_t C_t</math>.
*However, this is not <math>O(n\times m)</math> since
** The operation is performed element-wise, so it actually never materializes <math>\hat{V_t}</math> as a <math>n\times n</math> matrix
** It also only storing <math>R_t</math>and <math> C_t</math> instead of storage the full second-moment matrix

== Numerical Examples ==
== Applications ==
== Conclusion ==
== Reference ==

Adafactor

2024-12-11T04:22:49Z

Fall2024 Wiki Team6: /* Problem formulation */

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

== Introduction ==
== Problem formulation ==
=== 1. Objective ===
Minimize the loss function <math>f(x)</math>, where <math>x \in R^n</math> and <math>x</math> is the weight vector to be optimized.

=== 2. Parameters ===
*''' Gradient:'''
<math>G_t = \nabla f(x_{t-1})</math>

* '''Second moment estimate:'''

<math> \hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>

* '''Where:'''
** <math>\hat{V}_t</math> is the running average of the squared gradient.
**<math>\hat{\beta}_{2t}</math> is the corrected decay parameter.
**<math>\epsilon_1</math> is a regularization constant.

* '''Step size:'''
<math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math>
* '''Where''':
** <math>\rho_t</math> is the relative step size.
** <math>\epsilon_2</math> is a regularization constant.
** <math>\text{RMS}</math> is the root mean square, defined as:
*** <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
*** <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>
=== 3. Algorithms ===
==== Adafactor for Weighted Vectors ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^n</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update second moment estimate: <math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

==== Adafactor for Weighted Matrices ====
'''Inputs:'''
* Initial point: <math>X_0 \in \mathbb{R}^{n \times m}</math>
* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
* Clipping threshold: <math>d</math>

'''Algorithm:'''
* For <math>t = 1</math> to <math>T</math>:
** Compute adaptive step size: <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
** Compute gradient: <math>G_t = \nabla f_t(X_{t-1})</math>
** Update row-wise second moment: <math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
** Update column-wise second moment: <math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
** Update overall second moment estimate: <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
** Compute normalized gradient: <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
** Apply clipping: <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
** Update parameter: <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
* End for

=== Why Clipping ===
Adafactor employs clipping to maintain numerical stability, especially since it is designed for use with very large models and often works with unscaled learning rates.
* Clipping prevents the update step from becoming very large, which would destabilize training
* Clipping mitigates the effects of very large gradients preventing numerical instability
Therefore, implementing clipping helps ensure stability and efficient training without requiring per-parameter scaling like Adam.

=== Why Adafactor is more memory efficient, compared to Adam ===
'''Row-wise and Column-wise Second Moment Updates'''
*<math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
*<math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
Instead of storing the full <math>G_t^2</math>, Adafactor computes the row and column respectively, which reduces the memory requirements from <math>O(n\times m)</math> to <math>O(n + m)</math>

'''Factored Representation of the Second Moment'''
* <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
This updates the second momentum based on the outer product <math>R_t C_t</math>.
*However, this is not <math>O(n\times m)</math> since
** The operation is performed element-wise, so it actually never materializes <math>\hat{V_t}</math> as a <math>n\times n</math> matrix
** It also only storing <math>R_t</math>and <math> C_t</math> instead of storage the full second-moment matrix

=== 4. Proposed Hyperparameters for Adafactor ===
* '''Regularization constant 1''': <math>\epsilon_1 = 10^{-30}</math>
* Ensures numerical stability by preventing division by zero in the calculation of second-moment estimates, so the numerical value should be very close to zero
* '''Regularization constant 2''': <math>\epsilon_2 = 10^{-3}</math>
* Help to stabilize parameter updates by controlling the effect of second-moment scaling in low-magnitude scenarios. Compared to <math>\epsilon_2</math>, a relatively larger value ensures the stability of noise and low-magnitude scenarios.
* '''Clipping threshold''': <math>d = 1</math>
* A threshold of 1 balances stability and learning efficiency. It avoids excessive suppression of large gradients, which could hinder learning, while still protecting against extreme updates that could destabilize the model.
* '''Relative step size''': <math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math>
** <math>min(10^-2, ...)</math> can caps the learning rate at 10^-2, which is a empirical found for upper bound
** <math>\frac{1}{\sqrt{t}}</math> This step size promote convergence of the model. This rate ensures a balance between sufficient exploration in early iteration and stability in later iterations
* '''Second moment decay''': <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>
** 1-...: ensures the decay factor remains close to 1
** <math>t^{-0,8}</math> the power 0.8 ensures a balance between rapid adaptation in early training and later iterations

=== 5.Discussion ===
=== Why Clipping ===
Adafactor employs clipping to maintain numerical stability, especially since it is designed for use with very large models and often works with unscaled learning rates.
* Clipping prevents the update step from becoming very large, which would destabilize training
* Clipping mitigates the effects of very large gradients preventing numerical instability
Therefore, implementing clipping helps ensure stability and efficient training without requiring per-parameter scaling like Adam.

=== Why Adafactor is more memory efficient, compared to Adam ===
'''Row-wise and Column-wise Second Moment Updates'''
*<math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
*<math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
Instead of storing the full <math>G_t^2</math>, Adafactor computes the row and column respectively, which reduces the memory requirements from <math>O(n\times m)</math> to <math>O(n + m)</math>

'''Factored Representation of the Second Moment'''
* <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
This updates the second momentum based on the outer product <math>R_t C_t</math>.
*However, this is not <math>O(n\times m)</math> since
** The operation is performed element-wise, so it actually never materializes <math>\hat{V_t}</math> as a <math>n\times n</math> matrix
** It also only storing <math>R_t</math>and <math> C_t</math> instead of storage the full second-moment matrix

== Numerical Examples ==
== Applications ==
== Conclusion ==
== Reference ==

Adafactor

2024-12-11T04:21:45Z

Fall2024 Wiki Team6: /* Why Adafactor is more memory efficient, compared to Adam */

Adafactor

2024-12-11T04:20:12Z

Fall2024 Wiki Team6: /* Why Adafactor is more memory efficient, compared to Adam */

Adafactor

2024-12-11T04:19:39Z

Fall2024 Wiki Team6: /* Why Adafactor is more memory efficient, compared to Adam */

Adafactor

2024-12-11T04:18:39Z

Fall2024 Wiki Team6: /* Why Adafactor is more memory efficient, compared to Adam */

Adafactor

2024-12-11T04:17:38Z

Fall2024 Wiki Team6: /* Why Adafactor is more memory efficient, compared to Adam */

Adafactor

2024-12-11T04:15:23Z

Fall2024 Wiki Team6: /* Why Adafactor is more memory efficient, compared to Adam */

Adafactor

2024-12-11T04:08:49Z

Fall2024 Wiki Team6: /* Why Adafactor is more memory efficient, compared to Adam */

Adafactor

2024-12-11T04:07:24Z

Fall2024 Wiki Team6: /* Why Adafactor is more memory efficient, compared to Adam */

Adafactor

2024-12-11T04:06:49Z

Fall2024 Wiki Team6: /* Why Adafactor is more memory efficient, compared to Adam */

Adafactor

2024-12-11T04:04:58Z

Fall2024 Wiki Team6: /* Why Adafactor is more memory efficient, compared to Adam */

Adafactor

2024-12-11T04:04:43Z

Fall2024 Wiki Team6: /* Why Adafactor is more memory efficient, compared to Adam */

Adafactor

2024-12-11T04:03:23Z

Fall2024 Wiki Team6: /* Why Adafactor is more memory efficient, compared to Adam */

Adafactor

2024-12-11T03:59:52Z

Fall2024 Wiki Team6: /* Why Clipping */

Adafactor

2024-12-11T03:54:44Z

Fall2024 Wiki Team6: /* Why Clipping */

Adafactor

2024-12-11T03:52:38Z

Fall2024 Wiki Team6: /* Adafactor for Weighted Matrices */

Adafactor

2024-12-11T03:52:22Z

Fall2024 Wiki Team6: /* Why Clipping */