Main Page and Adafactor: Difference between pages

VisualWikitext

Revision as of 16:35, 10 December 2024

Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

Introduction

Problem formulation

1. Objective

Minimize the loss function $f(x)$ , where $x\in R^{n}$ and $x$ is the weight vector to be optimized.

2. Parameters

Gradient:

  $G_{t}=\nabla f(x_{t-1})$

Second moment estimate:

  ${\hat {V}}_{t}={\hat {\beta }}_{2t}{\hat {V}}_{t-1}+(1-{\hat {\beta }}_{2t})(G_{t}^{2}+\epsilon _{1}1_{n})$ 
 ** Where:
   *  ${\hat {V}}_{t}$  is the running average of the squared gradient.
   *  ${\hat {\beta }}_{2t}$  is the corrected decay parameter.
   *  $\epsilon _{1}$  is a regularization constant.

Step size:

  $\alpha _{t}=\max(\epsilon _{2},{\text{RMS}}(x_{t-1}))\rho _{t}$ 
 ** Where:
   *  $\rho _{t}$  is the relative step size.
   *  $\epsilon _{2}$  is a regularization constant.
   *  ${\text{RMS}}$  is the root mean square, defined as:
      $u_{xt}={\frac {-g_{xt}}{\sqrt {{\hat {v}}_{xt}}}}$ 
      ${\text{RMS}}(U_{t})={\text{RMS}}_{x\in X}(u_{xt})={\sqrt {{\text{Mean}}_{x\in X}\left({\frac {(g_{xt})^{2}}{{\hat {v}}_{xt}}}\right)}}$

3. Problem Formulation

Adafactor for Weighted Vectors

Inputs:

Initial point: $X_{0}\in \mathbb {R} ^{n}$
Relative step sizes: $\rho _{t}$ for $t=1$ to $T$
Second moment decay: ${\hat {\beta }}_{2t}$ for $t=1$ to $T$ , with ${\hat {\beta }}_{21}=0$
Regularization constants: $\epsilon _{1},\epsilon _{2}$
Clipping threshold: $d$

Algorithm:

For $t=1$ $t=1$ to $T$ $T$ :
1. Compute adaptive step size:

   $\alpha _{t}=\max(\epsilon _{2},{\text{RMS}}(X_{t-1}))\rho _{t}$

1. Compute gradient:

   $G_{t}=\nabla f_{t}(X_{t-1})$

1. Update second moment estimate:

   ${\hat {V}}_{t}={\hat {\beta }}_{2t}{\hat {V}}_{t-1}+(1-{\hat {\beta }}_{2t})(G_{t}^{2}+\epsilon _{1}1_{n})$

1. Compute normalized gradient:

   $U_{t}={\frac {G_{t}}{\sqrt {{\hat {V}}_{t}}}}$

1. Apply clipping:

   ${\hat {U}}_{t}={\frac {U_{t}}{\max(1,{\text{RMS}}(U_{t})/d)}}$

1. Update parameter:

   $X_{t}=X_{t-1}-\alpha _{t}{\hat {U}}_{t}$

End for

Adafactor for Weighted Matrices

Inputs:

Initial point: $X_{0}\in \mathbb {R} ^{n\times m}$
Relative step sizes: $\rho _{t}$ for $t=1$ to $T$
Second moment decay: ${\hat {\beta }}_{2t}$ for $t=1$ to $T$ , with ${\hat {\beta }}_{21}=0$
Regularization constants: $\epsilon _{1},\epsilon _{2}$
Clipping threshold: $d$

Algorithm:

For $t=1$ $t=1$ to $T$ $T$ :
1. Compute adaptive step size:

   $\alpha _{t}=\max(\epsilon _{2},{\text{RMS}}(X_{t-1}))\rho _{t}$

1. Compute gradient:

   $G_{t}=\nabla f_{t}(X_{t-1})$

1. Update row-wise second moment:

   $R_{t}={\hat {\beta }}_{2t}R_{t-1}+(1-{\hat {\beta }}_{2t})(G_{t}^{2}+\epsilon _{1}1_{n}1_{m}^{T})1_{m}$

1. Update column-wise second moment:

   $C_{t}={\hat {\beta }}_{2t}C_{t-1}+(1-{\hat {\beta }}_{2t})1_{n}^{T}(G_{t}^{2}+\epsilon _{1}1_{n}1_{m}^{T})$

1. Update overall second moment estimate:

   ${\hat {V}}_{t}={\frac {R_{t}C_{t}}{1_{n}^{T}R_{t}}}$

1. Compute normalized gradient:

   $U_{t}={\frac {G_{t}}{\sqrt {{\hat {V}}_{t}}}}$

1. Apply clipping:

   ${\hat {U}}_{t}={\frac {U_{t}}{\max(1,{\text{RMS}}(U_{t})/d)}}$

1. Update parameter:

   $X_{t}=X_{t-1}-\alpha _{t}{\hat {U}}_{t}$

End for

4. Proposed Hyperparameters for Adafactor

Regularization constant 1: $\epsilon _{1}=10^{-30}$
Regularization constant 2: $\epsilon _{2}=10^{-3}$
Clipping threshold: $d=1$
Relative step size: $\rho _{t}=\min(10^{-2},1/{\sqrt {t}})$
Second moment decay: ${\hat {\beta }}_{2t}=1-t^{-0.8}$

Main Page and Adafactor: Difference between pages

Revision as of 16:35, 10 December 2024

Contents

Introduction

Problem formulation

1. Objective

2. Parameters

3. Problem Formulation

Adafactor for Weighted Vectors

Adafactor for Weighted Matrices

4. Proposed Hyperparameters for Adafactor

Numerical Examples

Applications

Conclusion

Reference

Navigation menu

@@ Line 1: / Line 1: @@
-{| id="mp-topbanner" style="width:100%; background:#f6f6f6; margin-top:1.2em; border:1px solid #ddd;"
+Author: Aolei Cao (ac3237), Ziyang Li (zl986), Junjia Liang (jl4439) (ChemE 6800 Fall 2024)
-| style="width:61%; color:#000;" |
-{| style="width:100%; border:none; background:none;"
-| style="text-align:center; white-space:nowrap; color:#000;" |
-<div style="font-size:162%; border:none; margin:0; padding:.1em; color:#000;">Welcome to the Cornell University Computational Optimization Open Textbook</div>
-This electronic textbook is a student-contributed open-source text covering a variety of topics on process optimization.<br />
+Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu
-'''If you have any comments or suggestions on this open textbook, please contact [https://www.engineering.cornell.edu/faculty-directory/fengqi-you  Professor Fengqi You].'''
-|}
-|}
-{| id="mp-upper" style="width: 100%; margin:6px 0 0 0; background:none; border-spacing: 0px;"
+== Introduction ==
-| class="MainPageBG" style="width:50%; border:1px solid #cef2e0; background:#f5fffa; vertical-align:top; color:#000;" |
+== Problem formulation ==
-{| id="mp-left" style="width:100%; vertical-align:top; background:#f5fffa;"
+=== 1. Objective ===
-! style="padding:2px;" | <h2 id="mp-tfa-h2" style="margin:3px; background:#cef2e0; font-size:120%; font-weight:bold; border:1px solid #a3bfb1; text-align:left; color:#000; padding:0.2em 0.4em;">Linear Programming (LP)</h2>
+Minimize the loss function <math>f(x)</math>, where <math>x \in R^n</math> and <math>x</math> is the weight vector to be optimized.
-|-
-| style="color:#000;" | <div id="mp-tfa" style="padding:2px 5px">
-      <li>[[Duality]]</li>
-      <li>[[Simplex algorithm]]</li>
-      <li>[[Computational complexity]]</li>
-      <li>[[Network flow problem]]</li>
-      <li>[[Interior-point method for LP]]</li>
-      <li>[[Optimization with absolute values]]</li>
-      <li>[[Matrix game (LP for game theory)]]</li>
-</div>
-|-
-! style="padding:2px" | <h2 id="mp-dyk-h2" style="margin:3px; background:#cef2e0; font-size:120%; font-weight:bold; border:1px solid #a3bfb1; text-align:left; color:#000; padding:0.2em 0.4em;">NonLinear Programming (NLP)</h2>
-|-
-| style="color:#000;padding:2px 5px 5px" | <div id="mp-dyk">
-      <li>[[Line search methods]]</li>
-      <li>[[Trust-region methods]]</li>
-      <li>[[Interior-point method for NLP]]</li>
-      <li>[[Conjugate gradient methods]]</li>
-      <li>[[Quasi-Newton methods]]</li>
-      <li>[[Quadratic programming]]</li>
-      <li>[[Sequential quadratic programming]]</li>
-      <li>[[Subgradient optimization]]</li>
-      <li>[[Mathematical programming with equilibrium constraints]]</li>
-      <li>[[Dynamic optimization]]</li>
-      <li>[[Geometric programming]]</li>
-      <li>[[Nondifferentiable Optimization]]</li>
-</div>
-|-
-! style="padding:2px" | <h2 id="mp-dyk-h2" style="margin:3px; background:#cef2e0; font-size:120%; font-weight:bold; border:1px solid #a3bfb1; text-align:left; color:#000; padding:0.2em 0.4em;">Deterministic Global Optimization</h2>
-|-
-| style="color:#000;padding:2px 5px 5px" | <div id="mp-dyk">
-      <li>[[Exponential transformation]]</li>
-      <li>[[Logarithmic transformation]]</li>
-      <li>[[McCormick envelopes]]</li>
-      <li>[[Piecewise linear approximation]]</li>
-      <li>[[Spatial branch and bound method]]</li>
-</div>
-|-
-! style="padding:2px" | <h2 id="mp-dyk-h2" style="margin:3px; background:#cef2e0; font-size:120%; font-weight:bold; border:1px solid #a3bfb1; text-align:left; color:#000; padding:0.2em 0.4em;">Dynamic Programming</h2>
-|-
-| style="color:#000;padding:2px 5px 5px" | <div id="mp-dyk">
-      <li>[[Markov decision process]]</li>
-      <li>[[Bellman equation]]</li>
-      <li>[[Eight step procedures]]</li>
-      <li>[[Stochastic dynamic programming]]</li>
-</div>
-|-
-! style="padding:2px" | <h2 id="mp-dyk-h2" style="margin:3px; background:#cef2e0; font-size:120%; font-weight:bold; border:1px solid #a3bfb1; text-align:left; color:#000; padding:0.2em 0.4em;">Traditional Applications</h2>
-|-
-| style="color:#000;padding:2px 5px 5px" | <div id="mp-dyk">
-      <li>[[Facility location problem]]</li>
-      <li>[[Traveling salesman problem]]</li>
-      <li>[[Set covering problem]]</li>
-      <li>[[Quadratic assignment problem]]</li>
-      <li>[[Job shop scheduling]]</li>
-      <li>[[Newsvendor problem]]</li>
-      <li>[[Unit commitment problem]]</li>
-      <li>[[Portfolio optimization]]</li>
-</div>
-|}
+=== 2. Parameters ===
+* '''Gradient:'''
+  <math>G_t = \nabla f(x_{t-1})</math>
-| style="border:1px solid transparent;" |
+* '''Second moment estimate:'''
-<!--        IN THE NEWS; ON THIS DAY        -->
+  <math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>
-| class="MainPageBG" style="width:50%; border:1px solid #cedff2; background:#f5faff; vertical-align:top;"|
+  ** Where:
-{| id="mp-right" style="width:100%; vertical-align:top; background:#f5faff;"
+    * <math>\hat{V}_t</math> is the running average of the squared gradient.
-! style="padding:2px" | <h2 id="mp-otd-h2" style="margin:3px; background:#cedff2; font-size:120%; font-weight:bold; border:1px solid #a3b0bf; text-align:left; color:#000; padding:0.2em 0.4em;">Mixed-Integer Linear Programming (MILP)</h2>
+    * <math>\hat{\beta}_{2t}</math> is the corrected decay parameter.
-|-
+    * <math>\epsilon_1</math> is a regularization constant.
-| style="color:#000;padding:2px 5px 5px" | <div id="mp-otd">
-      <li>[[Mixed-integer cuts]]</li>
-      <li>[[Disjunctive inequalities]]</li>
-      <li>[[Lagrangean duality]]</li>
-      <li>[[Column generation algorithms]]</li>
-      <li>[[Heuristic algorithms]]</li>
-      <li>[[Branch and cut]]</li>
-      <li>[[Local branching]]</li></div>
-|-
-! style="padding:2px" | <h2 id="mp-otd-h2" style="margin:3px; background:#cedff2; font-size:120%; font-weight:bold; border:1px solid #a3b0bf; text-align:left; color:#000; padding:0.2em 0.4em;">Mixed-Integer NonLinear Programming (MINLP)</h2>
-|-
-| style="color:#000;padding:2px 5px 5px" | <div id="mp-otd">
-      <li>[[Signomial problems]]</li>
-      <li>[[Mixed-integer linear fractional programming (MILFP)]]</li>
-      <li>[[Convex generalized disjunctive programming (GDP)]]</li>
-      <li>[[Nonconvex generalized disjunctive programming (GDP)]]</li>
-      <li>[[Branch and bound (BB) for MINLP]]</li>
-      <li>[[Branch and cut for MINLP]]</li>
-      <li>[[Generalized Benders decomposition (GBD)]]</li>
-      <li>[[Outer-approximation (OA)]]</li>
-      <li>[[Extended cutting plane (ECP)]]</li>
-</div>
-|-
-! style="padding:2px" | <h2 id="mp-otd-h2" style="margin:3px; background:#cedff2; font-size:120%; font-weight:bold; border:1px solid #a3b0bf; text-align:left; color:#000; padding:0.2em 0.4em;">Optimization under Uncertainty</h2>
-|-
-| style="color:#000;padding:2px 5px 5px" | <div id="mp-dyk">
-      <li>[[Stochastic programming]]</li>
-      <li>[[Chance-constraint method]]</li>
-      <li>[[Fuzzy programming]]</li>
-      <li>[[Classical robust optimization]]</li>
-      <li>[[Adaptive robust optimization]]</li>
-      <li>[[Data driven robust optimization]]</li>
-</div>
-|-
-! style="padding:2px" | <h2 id="mp-otd-h2" style="margin:3px; background:#cedff2; font-size:120%; font-weight:bold; border:1px solid #a3b0bf; text-align:left; color:#000; padding:0.2em 0.4em;">Optimization for Machine Learning and Data Analytics</h2>
-|-
-| style="color:#000;padding:2px 5px 5px" | <div id="mp-dyk">
-      <li>[[Stochastic gradient descent]]</li>
-      <li>[[Momentum]]</li>
-      <li>[[AdaGrad]]</li>
-      <li>[[RMSProp]]</li>
-      <li>[[Adam]]</li>
-      <li>[[Frank-Wolfe]]</li>
-      <li>[[Sparse Reconstruction with Compressed Sensing]]</li>
-</div>
-|-
-! style="padding:2px" | <h2 id="mp-otd-h2" style="margin:3px; background:#cedff2; font-size:120%; font-weight:bold; border:1px solid #a3b0bf; text-align:left; color:#000; padding:0.2em 0.4em;">Emerging Applications</h2>
-|-
-| style="color:#000;padding:2px 5px 5px" | <div id="mp-dyk">
-      <li>[[Wing shape optimization]]</li>
-      <li>[[Optimization in game theory]]</li>
-      <li>[[Quantum computing for optimization]]</li>
-</div>
-|}
-|}
-== Sponsor ==
+* '''Step size:'''
-[[File:Peese-logo.jpg|Cornell Prof. Fengqi You Research Group |link=https://www.peese.org]]
+  <math>\alpha_t = \max(\epsilon_2, \text{RMS}(x_{t-1})) \rho_t</math>
+  ** Where:
+    * <math>\rho_t</math> is the relative step size.
+    * <math>\epsilon_2</math> is a regularization constant.
+    * <math>\text{RMS}</math> is the root mean square, defined as:
+      <math>u_{xt} = \frac{-g_{xt}}{\sqrt{\hat{v}_{xt}}}</math>
+      <math>\text{RMS}(U_t) = \text{RMS}_{x \in X}(u_{xt}) = \sqrt{\text{Mean}_{x \in X}\left(\frac{(g_{xt})^2}{\hat{v}_{xt}}\right)}</math>
-</noinclude>__NOTOC____NOEDITSECTION__
+=== 3. Problem Formulation ===
+==== Adafactor for Weighted Vectors ====
+'''Inputs:'''
+* Initial point: <math>X_0 \in \mathbb{R}^n</math>
+* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
+* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
+* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
+* Clipping threshold: <math>d</math>
+'''Algorithm:'''
+# For <math>t = 1</math> to <math>T</math>:
+## Compute adaptive step size:
+   <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
+## Compute gradient:
+   <math>G_t = \nabla f_t(X_{t-1})</math>
+## Update second moment estimate:
+   <math>\hat{V}_t = \hat{\beta}_{2t} \hat{V}_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n)</math>
+## Compute normalized gradient:
+   <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
+## Apply clipping:
+   <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
+## Update parameter:
+   <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
+# End for
+==== Adafactor for Weighted Matrices ====
+'''Inputs:'''
+* Initial point: <math>X_0 \in \mathbb{R}^{n \times m}</math>
+* Relative step sizes: <math>\rho_t</math> for <math>t = 1</math> to <math>T</math>
+* Second moment decay: <math>\hat{\beta}_{2t}</math> for <math>t = 1</math> to <math>T</math>, with <math>\hat{\beta}_{21} = 0</math>
+* Regularization constants: <math>\epsilon_1, \epsilon_2</math>
+* Clipping threshold: <math>d</math>
+'''Algorithm:'''
+# For <math>t = 1</math> to <math>T</math>:
+## Compute adaptive step size:
+   <math>\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \rho_t</math>
+## Compute gradient:
+   <math>G_t = \nabla f_t(X_{t-1})</math>
+## Update row-wise second moment:
+   <math>R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1 1_n 1_m^T) 1_m</math>
+## Update column-wise second moment:
+   <math>C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t}) 1_n^T (G_t^2 + \epsilon_1 1_n 1_m^T)</math>
+## Update overall second moment estimate:
+   <math>\hat{V}_t = \frac{R_t C_t}{1_n^T R_t}</math>
+## Compute normalized gradient:
+   <math>U_t = \frac{G_t}{\sqrt{\hat{V}_t}}</math>
+## Apply clipping:
+   <math>\hat{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t) / d)}</math>
+## Update parameter:
+   <math>X_t = X_{t-1} - \alpha_t \hat{U}_t</math>
+# End for
+=== 4. Proposed Hyperparameters for Adafactor ===
+* Regularization constant 1: <math>\epsilon_1 = 10^{-30}</math>
+* Regularization constant 2: <math>\epsilon_2 = 10^{-3}</math>
+* Clipping threshold: <math>d = 1</math>
+* Relative step size: <math>\rho_t = \min(10^{-2}, 1/\sqrt{t})</math>
+* Second moment decay: <math>\hat{\beta}_{2t} = 1 - t^{-0.8}</math>
+== Numerical Examples ==
+== Applications ==
+== Conclusion ==
+== Reference ==