|
|
Line 4: |
Line 4: |
|
| |
|
| == Introduction == | | == Introduction == |
| == Problem formulation ==
| |
| <h1>1. Objective</h1>
| |
| <p>Minimize the loss function <b>f(x)</b>, where <b>x ∈ ℝⁿ</b> and <b>x</b> is the weight vector to be optimized.</p>
| |
|
| |
| <h1>2. Parameters</h1>
| |
| <ul>
| |
| <li><b>Gradient:</b>
| |
| <div><i>G<sub>t</sub> = ∇f(x<sub>t-1</sub>)</i></div>
| |
| </li>
| |
| <li><b>Second moment estimate:</b>
| |
| <div><i>Ĥ<sub>Vt</sub> = Ĥ<sub>β2t</sub> Ĥ<sub>Vt-1</sub> + (1 - Ĥ<sub>β2t</sub>)(G<sub>t</sub>² + ε₁ 1ₙ)</i></div>
| |
| <ul>
| |
| <li><i>Ĥ<sub>Vt</sub></i> is the running average of the squared gradient.</li>
| |
| <li><i>Ĥ<sub>β2t</sub></i> is the corrected decay parameter.</li>
| |
| <li><i>ε₁</i> is a regularization constant.</li>
| |
| </ul>
| |
| </li>
| |
| <li><b>Step size:</b>
| |
| <div><i>α<sub>t</sub> = max(ε₂, RMS(x<sub>t-1</sub>)) ρ<sub>t</sub></i></div>
| |
| <ul>
| |
| <li><i>ρ<sub>t</sub></i> is the relative step size.</li>
| |
| <li><i>ε₂</i> is a regularization constant.</li>
| |
| <li><i>RMS</i> is the root mean square, defined as:
| |
| <div><i>u<sub>xt</sub> = -g<sub>xt</sub> / √Ĥ<sub>vxt</sub></i></div>
| |
| <div><i>RMS(U<sub>t</sub>) = RMS<sub>x ∈ X</sub>(u<sub>xt</sub>) = √Mean<sub>x ∈ X</sub>(g<sub>xt</sub>² / Ĥ<sub>vxt</sub>)</i></div>
| |
| </li>
| |
| </ul>
| |
| </li>
| |
| </ul>
| |
|
| |
| <h1>3. Problem Formulation</h1>
| |
| <h2>Adafactor for Weighted Vectors</h2>
| |
| <h3>Inputs:</h3>
| |
| <ul>
| |
| <li>Initial point: <i>X₀ ∈ ℝⁿ</i></li>
| |
| <li>Relative step sizes: <i>ρ<sub>t</sub></i> for <i>t = 1</i> to <i>T</i></li>
| |
| <li>Second moment decay: <i>Ĥ<sub>β2t</sub></i> for <i>t = 1</i> to <i>T</i>, with <i>Ĥ<sub>β21</sub> = 0</i></li>
| |
| <li>Regularization constants: <i>ε₁, ε₂</i></li>
| |
| <li>Clipping threshold: <i>d</i></li>
| |
| </ul>
| |
| <h3>Algorithm:</h3>
| |
| <ul>
| |
| <li>For <i>t = 1</i> to <i>T</i>:
| |
| <ul>
| |
| <li>Compute adaptive step size:
| |
| <div><i>α<sub>t</sub> = max(ε₂, RMS(X<sub>t-1</sub>)) ρ<sub>t</sub></i></div>
| |
| </li>
| |
| <li>Compute gradient:
| |
| <div><i>G<sub>t</sub> = ∇f<sub>t</sub>(X<sub>t-1</sub>)</i></div>
| |
| </li>
| |
| <li>Update second moment estimate:
| |
| <div><i>Ĥ<sub>Vt</sub> = Ĥ<sub>β2t</sub> Ĥ<sub>Vt-1</sub> + (1 - Ĥ<sub>β2t</sub>)(G<sub>t</sub>² + ε₁ 1ₙ)</i></div>
| |
| </li>
| |
| <li>Compute normalized gradient:
| |
| <div><i>U<sub>t</sub> = G<sub>t</sub> / √Ĥ<sub>Vt</sub></i></div>
| |
| </li>
| |
| <li>Apply clipping:
| |
| <div><i>Ĥ<sub>U<sub>t</sub></i> = U<sub>t</sub> / max(1, RMS(U<sub>t</sub>) / d)</i></div>
| |
| </li>
| |
| <li>Update parameter:
| |
| <div><i>X<sub>t</sub> = X<sub>t-1</sub> - α<sub>t</sub> Ĥ<sub>U<sub>t</sub></i></div>
| |
| </li>
| |
| </ul>
| |
| </li>
| |
| </ul>
| |
|
| |
| <h2>Adafactor for Weighted Matrices</h2>
| |
| <h3>Inputs:</h3>
| |
| <ul>
| |
| <li>Initial point: <i>X₀ ∈ ℝⁿ × ℝ<sup>m</sup></i></li>
| |
| <li>Relative step sizes: <i>ρ<sub>t</sub></i> for <i>t = 1</i> to <i>T</i></li>
| |
| <li>Second moment decay: <i>Ĥ<sub>β2t</sub></i> for <i>t = 1</i> to <i>T</i>, with <i>Ĥ<sub>β21</sub> = 0</i></li>
| |
| <li>Regularization constants: <i>ε₁, ε₂</i></li>
| |
| <li>Clipping threshold: <i>d</i></li>
| |
| </ul>
| |
| <h3>Algorithm:</h3>
| |
| <ul>
| |
| <li>For <i>t = 1</i> to <i>T</i>:
| |
| <ul>
| |
| <li>Compute adaptive step size:
| |
| <div><i>α<sub>t</sub> = max(ε₂, RMS(X<sub>t-1</sub>)) ρ<sub>t</sub></i></div>
| |
| </li>
| |
| <li>Compute gradient:
| |
| <div><i>G<sub>t</sub> = ∇f<sub>t</sub>(X<sub>t-1</sub>)</i></div>
| |
| </li>
| |
| <li>Update row-wise second moment:
| |
| <div><i>R<sub>t</sub> = Ĥ<sub>β2t</sub> R<sub>t-1</sub> + (1 - Ĥ<sub>β2t</sub>)(G<sub>t</sub>² + ε₁ 1ₙ 1ₘᵀ) 1ₘ</i></div>
| |
| </li>
| |
| <li>Update column-wise second moment:
| |
| <div><i>C<sub>t</sub> = Ĥ<sub>β2t</sub> C<sub>t-1</sub> + (1 - Ĥ<sub>β2t</sub>) 1ₙᵀ (G<sub>t</sub>² + ε₁ 1ₙ 1ₘᵀ)</i></div>
| |
| </li>
| |
| <li>Update overall second moment estimate:
| |
| <div><i>Ĥ<sub>Vt</sub> = R<sub>t</sub> C<sub>t</sub> / (1ₙᵀ R<sub>t</sub>)</i></div>
| |
| </li>
| |
| <li>Compute normalized gradient:
| |
| <div><i>U<sub>t</sub> = G<sub>t</sub> / √Ĥ<sub>Vt</sub></i></div>
| |
| </li>
| |
| <li>Apply clipping:
| |
| <div><i>Ĥ<sub>U<sub>t</sub></i> = U<sub>t</sub> / max(1, RMS(U<sub>t</sub>) / d)</i></div>
| |
| </li>
| |
| <li>Update parameter:
| |
| <div><i>X<sub>t</sub> = X<sub>t-1</sub> - α<sub>t</sub> Ĥ<sub>U<sub>t</sub></i></div>
| |
| </li>
| |
| </ul>
| |
| </li>
| |
| </ul>
| |
|
| |
| <h1>4. Proposed Hyperparameters for Adafactor</h1>
| |
| <ul>
| |
| <li>Regularization constant 1: <i>ε₁ = 10⁻³⁰</i></li>
| |
| <li>Regularization constant 2: <i>ε₂ = 10⁻³</i></li>
| |
| <li>Clipping threshold: <i>d = 1</i></li>
| |
| <li>Relative step size: <i>ρ<sub>t</sub> = min(10⁻², 1/√t)</i></li>
| |
| <li>Second moment decay: <i>Ĥ<sub>β2t</sub> = 1 - t⁻⁰.⁸</i></li>
| |
| </ul>
| |
| </body>
| |
| </html>
| |
|
| |
| == Numerical Examples == | | == Numerical Examples == |
| == Applications == | | == Applications == |
| == Conclusion == | | == Conclusion == |
| == Reference == | | == Reference == |