AdaGrad: Difference between revisions
Line 34: | Line 34: | ||
<math display="block">x_{t+1} = x_t - \eta G_t^{-1/2} g_t</math>where <math>G_t^{-1/2}</math> is the inverse of the square root of <math>G_t</math>. A simplified version of the update rule takes the diagonal elements of <math>G_t^{-1/2}</math> instead of the whole matrix: | <math display="block">x_{t+1} = x_t - \eta G_t^{-1/2} g_t</math>where <math>G_t^{-1/2}</math> is the inverse of the square root of <math>G_t</math>. A simplified version of the update rule takes the diagonal elements of <math>G_t^{-1/2}</math> instead of the whole matrix: | ||
<math display="block">x_{t+1} = x_t - \eta \text{diag}(G_t^{-1/2}) g_t</math>which can be computed in linear time. | <math display="block">x_{t+1} = x_t - \eta \text{diag}(G_t^{-1/2}) g_t</math>which can be computed in linear time. In practice, a small quantity <math>\epsilon</math> is added to each diagonal element to avoid singularity, the resulting update rule is: | ||
<math display="block">x_{t+1} = x_t - \eta \text{diag}(\epsilon I + G_t)^{-1/2} g_t</math>where <math>I </math> denotes the identity matrix. | |||
=== Algorithm === | === Algorithm === |
Revision as of 17:49, 26 November 2021
Author: Daniel Villarraga (SYSEN 6800 Fall 2021)
Introduction
AdaGrad is a family of sub-gradient algorithms for stochastic optimization. The algorithms belonging to that family are similar to second-order stochastic gradient descend with an approximation for the Hessian of the optimized function. AdaGrad's name comes from Adaptative Gradient. Intuitively, it adapts the learning rate for each feature depending on the estimated geometry of the function; additionally, it tends to assign higher learning rates to infrequent features, which ensures that the parameter updates rely less on frequency and more on relevance.
AdaGrad was introduced by Duchi et al.[1] in a highly cited paper published in the Journal of machine learning research in 2011. It is arguably one of the most popular algorithms for machine learning (particularly for training deep neural networks) and it influenced the development of the Adam algorithm[2].
Theory
The objective of AdaGrad is to minimize the expected value of a stochastic objective function, with respect to a set of parameters, given a sequence of realizations of the function. As with other sub-gradient-based methods, it achieves so by updating the parameters in the opposite direction of the sub-gradients. While standard sub-gradient methods use update rules with step-sizes that ignore the information from the past observations, AdaGrad adapts the learning rates for each parameter individually using the sequence of gradient estimates.
Definitions
: Stochastic objective function with parameters .
: Realization of stochastic objective at time step . For simplicity .
: The gradient of with respect to , formally . For simplicity, .
: Parameters at time step .
: Outer product of all previous subgradients, given by
Standard Sub-gradient Update
Standard sub-gradient algorithms update parameters according to the following rule:
AdaGrad Update
The general update for AdaGrad is given by: