AdaGrad: Difference between revisions

Revision as of 15:39, 26 November 2021

Author: Daniel Villarraga (SYSEN 6800 Fall 2021)

Introduction

AdaGrad is a family of sub-gradient algorithms for stochastic optimization. The algorithms belonging to that family are similar to second-order stochastic gradient descend with an approximation for the Hessian of the optimized function. AdaGrad's name comes from Adaptative Gradient. Intuitively, it adapts the learning rate for each feature depending on the estimated geometry of the function; additionally, it tends to assign higher learning rates to infrequent features, which ensures that the parameter updates rely less on frequency and more on relevance.

AdaGrad was introduced by Duchi et al.^[1] in a highly cited paper published in the Journal of machine learning research in 2011. It is arguably one of the most popular algorithms for machine learning (particularly for training deep neural networks) and it influenced the development of the Adam algorithm^[2].

Theory

Definitions

Traditional Sub-gradient Update

Adagrad Update

Algorithm

Regret Bounds

Empirical Performance

Numerical Example

Applications

Summary and Discussion

References

↑ Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7).
↑ Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

[1] Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7).

[2] Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

[1]

[2]

@@ Line 2: / Line 2: @@
 == Introduction ==
-AdaGrad is a family of sub-gradient algorithms for stochastic optimization. The algorithms belonging to that family are similar to second-order [[Stochastic gradient descent|stochastic gradient descend]] with an approximation for the Hessian of the optimized function. AdaGrad's name comes from '''Ada'''ptative '''Grad'''ient. Intuitively, it adapts the learning rate for each feature depending on the estimated geometry of the function; additionally, it tends to assign higher learning rates to infrequent features, which ensures that the parameter updates depend on feature relevance rather than frequency.
+AdaGrad is a family of sub-gradient algorithms for stochastic optimization. The algorithms belonging to that family are similar to second-order stochastic gradient descend with an approximation for the Hessian of the optimized function. AdaGrad's name comes from '''Ada'''ptative '''Grad'''ient. Intuitively, it adapts the learning rate for each feature depending on the estimated geometry of the function; additionally, it tends to assign higher learning rates to infrequent features, which ensures that the parameter updates rely less on frequency and more on relevance.
 AdaGrad was introduced by Duchi et al.<ref>Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. ''Journal of machine learning research'', ''12''(7).</ref> in a highly cited paper published in the Journal of machine learning research in 2011. It is arguably one of the most popular algorithms for machine learning (particularly for training deep neural networks) and it influenced the development of the [[Adam|Adam algorithm]]<ref>Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. ''arXiv preprint arXiv:1412.6980''.</ref>.

AdaGrad: Difference between revisions

Revision as of 15:39, 26 November 2021

Contents

Introduction

Theory

Definitions

Traditional Sub-gradient Update

Adagrad Update

Algorithm

Regret Bounds

Empirical Performance

Numerical Example

Applications

Summary and Discussion

References

Navigation menu

AdaGrad: Difference between revisions

Revision as of 15:39, 26 November 2021

Introduction

Theory

Definitions

Traditional Sub-gradient Update

Adagrad Update

Algorithm

Regret Bounds

Empirical Performance

Numerical Example

Applications

Summary and Discussion

References

Navigation menu

Search