AdaGrad: Difference between revisions
Line 66: | Line 66: | ||
== Numerical Example == | == Numerical Example == | ||
[[File:AdaGrad trayectory.png|left|thumb|473x473px|This figure shows a trayectory of parameter updates for the example presented in this section using AdaGrad. ]] | [[File:AdaGrad trayectory.png|left|thumb|473x473px|This figure shows a trayectory of parameter updates for the example presented in this section using AdaGrad. ]]To illustrate how the parameter updates work in AdaGrad let's go through a numerical example using linear regression with the sum of squared errors. The dataset consists of random generated obsevations <math>x,y </math> that follows the relationship: | ||
<math display="block">y = 20 \cdot x + \epsilon </math> | |||
{| class="wikitable mw-collapsible" | |||
!<math>x </math> | |||
!<math>y </math> | |||
|- | |||
|0.39 | |||
|9.83 | |||
|- | |||
|0.10 | |||
|2.27 | |||
|- | |||
|0.30 | |||
|5.10 | |||
|} | |||
== Applications == | == Applications == |
Revision as of 13:51, 28 November 2021
Author: Daniel Villarraga (SYSEN 6800 Fall 2021)
Introduction
AdaGrad is a family of sub-gradient algorithms for stochastic optimization. The algorithms belonging to that family are similar to second-order stochastic gradient descend with an approximation for the Hessian of the optimized function. AdaGrad's name comes from Adaptative Gradient. Intuitively, it adapts the learning rate for each feature depending on the estimated geometry of the problem; particularly, it tends to assign higher learning rates to infrequent features, which ensures that the parameter updates rely less on frequency and more on relevance.
AdaGrad was introduced by Duchi et al.[1] in a highly cited paper published in the Journal of machine learning research in 2011. It is arguably one of the most popular algorithms for machine learning (particularly for training deep neural networks) and it influenced the development of the Adam algorithm[2].
Theory
The objective of AdaGrad is to minimize the expected value of a stochastic objective function, with respect to a set of parameters, given a sequence of realizations of the function. As with other sub-gradient-based methods, it achieves so by updating the parameters in the opposite direction of the sub-gradients. While standard sub-gradient methods use update rules with step-sizes that ignore the information from the past observations, AdaGrad adapts the learning rate for each parameter individually using the sequence of gradient estimates.
Definitions
: Stochastic objective function with parameters .
: Realization of stochastic objective at time step . For simplicity .
: The gradient of with respect to , formally . For simplicity, .
: Parameters at time step .
: The outer product of all previous subgradients, given by
Standard Sub-gradient Update
Standard sub-gradient algorithms update parameters according to the following rule:
AdaGrad Update
The general AdaGrad update rule is given by:
Adaptative Learning Rate Effect
An estimate for the uncentered second moment of the objective function's gradient is given by the following expression:
which is similar to the definition of matrix , used in AdaGrad's update rule. Noting that, AdaGrad adapts the learning rate for each parameter proportionally to the inverse of the gradient's variance for every parameter. This leads to the main advantages of AdaGrad:
- Parameters associated with low frequency features tend to have larger learning rates than parameters associated with high frequency features.
- Step-sizes in directions with high gradient variance are lower than the step-sizes in directions with low gradient variance. Geometrically, the step-sizes tend to decrease proportionally to the curvature of the stochastic objective function.
which favor the convergence rate of the algorithm.
Algorithm
The general version of the AdaGrad algorithm is presented in the pseudocode below. The update step within the for loop can be modified with the version that uses the diagonal of .
Regret Bound
The regret is defined as:
where is the set of optimal parameters. AdaGrad has a regret bound of order , which leads to the convergence rate of order , and the convergence guarantee (). The detailed proof and assumptions for this bound can be found in the original journal paper[1].
Numerical Example
To illustrate how the parameter updates work in AdaGrad let's go through a numerical example using linear regression with the sum of squared errors. The dataset consists of random generated obsevations that follows the relationship:
0.39 | 9.83 |
0.10 | 2.27 |
0.30 | 5.10 |
Applications
The AdaGrad algorithm