AdaGrad: Difference between revisions

Revision as of 18:29, 26 November 2021

Author: Daniel Villarraga (SYSEN 6800 Fall 2021)

Introduction

AdaGrad is a family of sub-gradient algorithms for stochastic optimization. The algorithms belonging to that family are similar to second-order stochastic gradient descend with an approximation for the Hessian of the optimized function. AdaGrad's name comes from Adaptative Gradient. Intuitively, it adapts the learning rate for each feature depending on the estimated geometry of the problem; particularly, it tends to assign higher learning rates to infrequent features, which ensures that the parameter updates rely less on frequency and more on relevance.

AdaGrad was introduced by Duchi et al.^[1] in a highly cited paper published in the Journal of machine learning research in 2011. It is arguably one of the most popular algorithms for machine learning (particularly for training deep neural networks) and it influenced the development of the Adam algorithm^[2].

Theory

The objective of AdaGrad is to minimize the expected value of a stochastic objective function, with respect to a set of parameters, given a sequence of realizations of the function. As with other sub-gradient-based methods, it achieves so by updating the parameters in the opposite direction of the sub-gradients. While standard sub-gradient methods use update rules with step-sizes that ignore the information from the past observations, AdaGrad adapts the learning rate for each parameter individually using the sequence of gradient estimates.

Definitions

$f(x)$ : Stochastic objective function with parameters $x$ .

$f_{t}(x)$ : Realization of stochastic objective at time step $t$ . For simplicity $f_{t}$ .

$g_{t}(x)$ : The gradient of $f_{t}(x)$ with respect to $x$ , formally $\nabla _{x}f_{t}(x)$ . For simplicity, $g_{t}$ .

$x_{t}$ : Parameters at time step $t$ .

$G_{t}$ : The outer product of all previous subgradients, given by ${\textstyle \sum _{\tau =1}^{t}g_{\tau }g_{\tau }^{\top }}$

Standard Sub-gradient Update

Standard sub-gradient algorithms update parameters $x$ according to the following rule:

x_{t+1}=x_{t}-\eta g_{t}

where

\eta

denotes the step-size often refered as learning rate. Expanding each term on the previous equation, the vector of parameters is updated as follows:

{\begin{bmatrix}x_{t+1}^{(2)}\\x_{t+1}^{(2)}\\\vdots \\x_{t+1}^{(m)}\end{bmatrix}}={\begin{bmatrix}x_{t}^{(2)}\\x_{t}^{(2)}\\\vdots \\x_{t}^{(m)}\end{bmatrix}}-\eta {\begin{bmatrix}g_{t}^{(2)}\\g_{t}^{(2)}\\\vdots \\g_{t}^{(m)}\end{bmatrix}}

AdaGrad Update

The general AdaGrad update rule is given by:

x_{t+1}=x_{t}-\eta G_{t}^{-1/2}g_{t}

where

G_{t}^{-1/2}

is the inverse of the square root of

G_{t}

. A simplified version of the update rule takes the diagonal elements of

G_{t}

instead of the whole matrix:

x_{t+1}=x_{t}-\eta {\text{diag}}(G_{t})^{-1/2}g_{t}

which can be computed in linear time. In practice, a small quantity

\epsilon

is added to each diagonal element in

G_{t}

to avoid singularity problems, the resulting update rule is given by:

x_{t+1}=x_{t}-\eta {\text{diag}}(\epsilon I+G_{t})^{-1/2}g_{t}

where

I

denotes the identity matrix. An expanded form of the previous update is presented below,

{\begin{bmatrix}x_{t+1}^{(2)}\\x_{t+1}^{(2)}\\\vdots \\x_{t+1}^{(m)}\end{bmatrix}}={\begin{bmatrix}x_{t}^{(2)}\\x_{t}^{(2)}\\\vdots \\x_{t}^{(m)}\end{bmatrix}}-{\begin{bmatrix}\eta {\frac {1}{\sqrt {\epsilon +G_{t}^{(1,1)}}}}\\\eta {\frac {1}{\sqrt {\epsilon +G_{t}^{(2,2)}}}}\\\vdots \\\eta {\frac {1}{\sqrt {\epsilon +G_{t}^{(m,m)}}}}\end{bmatrix}}\odot {\begin{bmatrix}g_{t}^{(2)}\\g_{t}^{(2)}\\\vdots \\g_{t}^{(m)}\end{bmatrix}}

where the operator

\odot

denotes the Hadamard product between matrices of the same dimension, and

G_{t}^{(j,j)}

is the

j

element in the

G_{t}

diagonal. From the last expression, it is clear that the update rule for AdaGrad adapts the step-size for each parameter accoding to

\eta {\frac {1}{\sqrt {\epsilon +G_{t}^{j,j}}}}

, in contrast with standard sub-gradient methods that have fixed step-size

\eta

for every parameter.

Algorithm

Regret Bounds

Empirical Performance

Numerical Example

Applications

Summary and Discussion

References

↑ Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7).
↑ Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

[1] Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7).

[2] Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

[1]

[2]

@@ Line 2: / Line 2: @@
 == Introduction ==
-AdaGrad is a family of sub-gradient algorithms for stochastic optimization. The algorithms belonging to that family are similar to second-order stochastic gradient descend with an approximation for the Hessian of the optimized function. AdaGrad's name comes from '''Ada'''ptative '''Grad'''ient. Intuitively, it adapts the learning rate for each feature depending on the estimated geometry of the function; additionally, it tends to assign higher learning rates to infrequent features, which ensures that the parameter updates rely less on frequency and more on relevance.
+AdaGrad is a family of sub-gradient algorithms for stochastic optimization. The algorithms belonging to that family are similar to second-order stochastic gradient descend with an approximation for the Hessian of the optimized function. AdaGrad's name comes from '''Ada'''ptative '''Grad'''ient. Intuitively, it adapts the learning rate for each feature depending on the estimated geometry of the problem; particularly, it tends to assign higher learning rates to infrequent features, which ensures that the parameter updates rely less on frequency and more on relevance.
 AdaGrad was introduced by Duchi et al.<ref>Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. ''Journal of machine learning research'', ''12''(7).</ref> in a highly cited paper published in the Journal of machine learning research in 2011. It is arguably one of the most popular algorithms for machine learning (particularly for training deep neural networks) and it influenced the development of the [[Adam|Adam algorithm]]<ref>Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. ''arXiv preprint arXiv:1412.6980''.</ref>.
@@ Line 27: / Line 27: @@
 Standard sub-gradient algorithms update parameters <math>x</math> according to the following rule:
-<math display="block">x_{t+1} = x_t - \eta g_t</math>where <math>\eta</math> denotes the step-size often refered as learning rate. Expanding each term on the previous equation,
+<math display="block">x_{t+1} = x_t - \eta g_t</math>where <math>\eta</math> denotes the step-size often refered as learning rate. Expanding each term on the previous equation, the vector of parameters is updated as follows:
 <math display="block">\begin{bmatrix} x_{t+1}^{(2)} \\ x_{t+1}^{(2)} \\ \vdots \\ x_{t+1}^{(m)} \end{bmatrix} = \begin{bmatrix} x_{t}^{(2)} \\ x_{t}^{(2)} \\ \vdots \\ x_{t}^{(m)} \end{bmatrix} - \eta \begin{bmatrix} g_{t}^{(2)} \\ g_{t}^{(2)} \\ \vdots \\ g_{t}^{(m)} \end{bmatrix} </math>
@@ Line 38: / Line 38: @@
 <math display="block">x_{t+1} = x_t - \eta \text{diag}(G_t)^{-1/2} g_t</math>which can be computed in linear time. In practice, a small quantity <math>\epsilon</math> is added to each diagonal element in <math>G_t</math> to avoid singularity problems, the resulting update rule is given by:
-<math display="block">x_{t+1} = x_t - \eta \text{diag}(\epsilon I + G_t)^{-1/2} g_t</math>where <math>I </math> denotes the identity matrix.
+<math display="block">x_{t+1} = x_t - \eta \text{diag}(\epsilon I + G_t)^{-1/2} g_t</math>where <math>I </math> denotes the identity matrix. An expanded form of the previous update is presented below,
+<math display="block">\begin{bmatrix} x_{t+1}^{(2)} \\ x_{t+1}^{(2)} \\ \vdots \\ x_{t+1}^{(m)} \end{bmatrix} = \begin{bmatrix} x_{t}^{(2)} \\ x_{t}^{(2)} \\ \vdots \\ x_{t}^{(m)} \end{bmatrix} - \begin{bmatrix} \eta \frac{1}{\sqrt{\epsilon + G_t^{(1,1)}}} \\ \eta \frac{1}{\sqrt{\epsilon + G_t^{(2,2)}}} \\ \vdots \\ \eta \frac{1}{\sqrt{\epsilon + G_t^{(m,m)}}} \end{bmatrix} \odot \begin{bmatrix} g_{t}^{(2)} \\ g_{t}^{(2)} \\ \vdots \\ g_{t}^{(m)} \end{bmatrix} </math>where the operator <math>\odot</math> denotes the [[wikipedia:Hadamard_product_(matrices)|Hadamard product]] between matrices of the same dimension, and <math>G_t^{(j,j)}</math> is the <math>j</math> element in the <math>G_t</math> diagonal. From the last expression, it is clear that the update rule for AdaGrad adapts the step-size for each parameter accoding to <math>\eta \frac{1}{\sqrt{\epsilon + G_t^{j,j}}}</math>, in contrast with standard sub-gradient methods that have fixed step-size <math>\eta</math> for every parameter.
 === Algorithm ===

AdaGrad: Difference between revisions

Revision as of 18:29, 26 November 2021

Contents

Introduction

Theory

Definitions

Standard Sub-gradient Update

AdaGrad Update

Algorithm

Regret Bounds

Empirical Performance

Numerical Example

Applications

Summary and Discussion

References

Navigation menu