AdaGrad: Difference between revisions

From Cornell University Computational Optimization Open Textbook - Optimization Wiki
Jump to navigation Jump to search
No edit summary
Line 12: Line 12:
== Applications ==
== Applications ==


== Empirical Performance ==
=== Empirical Performance ===


== Summary and Discussion ==
== Summary and Discussion ==


== References ==
== References ==

Revision as of 14:02, 26 November 2021

Author: Daniel Villarraga (SYSEN 6800 Fall 2021)

Introduction

AdaGrad is a family of sub-gradient algorithms for stochastic optimization. The algorithms belonging to that family are similar to second-order stochastic gradient descend with an approximation for the Hessian of the optimized function. Intuitively, it adapts the learning rate for each feature depending on the estimated geometry of the function; additionally, it tends to assign higher learning rates to infrequent features, which ensures that the updates are significant each time (independent of feature frequency).

AdaGrad was introduced by Duchi et al.[1] in a highly cited paper published in the Journal of machine learning research in 2011. It is arguably one of the most popular algorithms for machine learning, and it influenced the development of the Adam algorithm[2].

Theory

Numerical Example

Applications

Empirical Performance

Summary and Discussion

References

  1. Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7).
  2. Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.