Momentum: Difference between revisions

From Cornell University Computational Optimization Open Textbook - Optimization Wiki
Jump to navigation Jump to search
No edit summary
 
(29 intermediate revisions by 3 users not shown)
Line 1: Line 1:
Authors: Thomas Lee, Greta Gasswint, Elizabeth Henning (SYSEN5800 Fall 2021)
Authors: Thomas Lee, Greta Gasswint, and Elizabeth Henning (SYSEN5800 Fall 2021)


== Introduction ==
== Introduction ==
'''Momentum''' is an extension to the [https://optimization.cbe.cornell.edu/index.php?title=Stochastic_gradient_descent gradient descent] optimization algorithm that builds inertia in a search direction to overcome local minima and oscillation of noisy gradients<ref name=":0">https://machinelearningmastery.com/gradient-descent-with-momentum-from-scratch/</ref>. It is based on the same concept of momentum in physics. A classic example is a ball rolling down a hill that gathers enough momentum to overcome a plateau region and make it to a global minima instead of getting stuck at a local minima. Momentum adds history to the parameter updates which significantly accelerates the optimization process. Momentum controls the amount of history to include in the update equation via a hyperparameter<ref name=":0" />. This hyperparameter is a value ranging from 0 to 1. A momentum of 0 is equivalent to gradient descent without momentum<ref name=":0" />. A higher momentum value means more gradients from the past (history) are considered<ref name=":1">https://towardsdatascience.com/gradient-descent-with-momentum-59420f626c8f</ref>.  
Momentum is an extension to the [https://optimization.cbe.cornell.edu/index.php?title=Stochastic_gradient_descent gradient descent] optimization algorithm that builds inertia in a search direction to overcome local minima and oscillation of noisy gradients.<ref name=":0">Brownlee, J. (2021, October 11). ''Gradient descent with momentum from scratch''. Machine Learning Mastery. https://machinelearningmastery.com/gradient-descent-with-momentum-from-scratch/.</ref> It is based on the same concept of momentum in physics. A classical example of the concept is a ball rolling down a hill that gathers enough momentum to overcome a plateau region and make it to a global minima instead of getting stuck at a local minima. Momentum adds history to the parameter updates of descent problems which significantly accelerates the optimization process. The amount of history included in the update equation is determined via a hyperparameter.<ref name=":0" /> This hyperparameter is a value ranging from 0 to 1, where a momentum value of 0 is equivalent to gradient descent without momentum.<ref name=":0" /> A higher momentum value means more gradients from the past (history) are considered.<ref name=":1">Bhat, R. (2020, October 22). ''Gradient descent with momentum''. Medium. https://towardsdatascience.com/gradient-descent-with-momentum-59420f626c8f</ref>  
 
Gradient descent also has several limitations which can be addressed using momentum application, including issues with noise in the oscillations and potential failure to solve due to non-convex functions having multiple local minima.<ref name=":4">J. Sum, C. -S. Leung and K. Ho, "A Limitation of Gradient Descent Learning," in IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 6, pp. 2227-2232, June 2020, doi: 10.1109/TNNLS.2019.2927689.</ref><ref name=":1" /> Momentum addresses these limitations using exponentially weighted gradients, avoiding errors caused by iterations being equally weighted.
== Theory, methodology, and/or algorithmic discussions ==
== Theory, methodology, and/or algorithmic discussions ==
=== Definition ===
hi


=== Algorithm ===
=== Algorithm ===
The main idea behind momentum is to compute an exponentially weighted average of the gradients and use that to update the weights. By taking past gradients into account, the steps to gradient descent become smoothed out.   
The main idea behind momentum is to compute an exponentially weighted average of the gradients and use that to update the weights. By taking past gradients into account, the steps to gradient descent become smoothed out, which can reduce the amount of oscillations seen in iterations (where the iterative solutions are going between above and below the true curve). <ref>Kevin P. Murphy. ''Machine Learning : A Probabilistic Perspective''. The MIT Press; 2012. https://search-ebscohost-com.proxy.library.cornell.edu/login.aspx?direct=true&db=nlebk&AN=480968&site=ehost-live</ref>  


In gradient descent (stochastic) without momentum, the update rule at each iteration is given by:
In (stochastic) gradient descent without momentum, the update rule at each iteration is given by:


''W = W - <math>\alpha</math>dW''
<math>\theta_i=\theta_{i-1}-\gamma*g_i 
</math>


Where:
Where:
*''W'' denotes the parameters to the cost function
*''<math>\theta
*''dW'' is the gradient indicating which direction to decrease the cost function by
*''<math>\alpha</math>'' is the hyperparameter representing the learning rate
 


In gradient descent (stochastic) with momentum, the update rule at each iteration is given by:
</math>'' denotes the parameters to the cost function
*''<math>g_i 
</math>'' is the gradient indicating which direction to decrease the cost function by
*''<math>\gamma
</math>'' is the hyperparameter representing the learning rate<br />
In (stochastic) gradient descent with momentum, the update rule at each iteration is given by:


''V<math>dw</math> = βV<math>dw</math> + (1-β)dW''
<math>b_i=\mu*b_{i-1}+g_i 
</math>


''W = W -'' <math>\alpha</math>''V<math>dw</math>''
<math>\theta_i=\theta_{i-1}-\gamma*b_i 
</math>


Where:
Where:
* ''β'' is a new hyperparameter that denotes the momentum constant


*''<math>\theta


Momentum can be applied to other gradient descent variations such as batch gradient descent and mini-batch gradient descent. Regardless of the gradient descent variation, the momentum hyperparameter ''β'' will need to be tuned in addition to the learning rate hyper parameter <math>\alpha</math>''β'' can be any value between 0 and 1. Generally, the larger the momentum value, the more we take past gradients into account. This results in a smoother update. Consequently, choosing a ''β'' value that is too large will result in over smoothing. A momentum value of 0.9 tends to be a good default choice<ref name=":2" />. Finding the optimal ''β'' is a matter of trial and error to see what works best for your model such that the cost function is minimized.
</math>'' denotes the parameters to the cost function
*''<math>g_i 
</math>'' is the gradient indicating which direction to decrease the cost function by
*''<math>\gamma
</math>'' is the hyperparameter representing the learning rate
*<math>b_i
</math> is the modified step direction term (as opposed to just using ''<math>g_i 
</math>'') that incorporates momentum


=== Problems with Gradient Descent ===
*''<math>\mu
[[File:Gradient descent convex.png|thumb|A gradient descent scenario for a convex cost function.]]
</math>'' is a new hyperparameter that denotes the momentum constant
With gradient descent, a weight update at time ''t'' is given by the learning rate and gradient at that exact moment<ref name=":1" />. It doesn't not consider previous steps when searching.


This results in two issues:


# Unlike convex functions, a non-convex cost function can have many local minima's meaning the first local minima found is not guaranteed to be the global minima. At the local minima, the gradient of the cost function will be very small resulting in no weight updates. Because of this, gradient descent will get stuck and fail to find the most optimal solution.
Momentum can be applied to other gradient descent variations such as batch gradient descent and mini-batch gradient descent. Regardless of the gradient descent variation, the momentum hyperparameter ''<math>\mu
# Gradient descent can be noisy with many oscillations which results in a larger number of iterations needed to reach convergence.  
</math>'' must be tuned in addition to the learning rate hyperparameter ''<math>\gamma
</math>''.  ''<math>\mu
</math>'' can be any value between 0 and 1. Generally, the larger the momentum value, the more past gradients must be taken into account, resulting in a smoother update. However, choosing a ''<math>\mu
</math>''  value that is too large will result in over smoothing. It is also possible for a large ''<math>\mu
</math>''  value to overshoot the target. As the ''<math>\mu
</math>'' value gets closer to 0, momentum behaves similarly to steepest descent.<ref>Craven, M., & Page, D. (n.d.). ''Deep learning II: Momentum & Adaptive Step Size - UW–madison''. https://www.biostat.wisc.edu/~craven/cs760/lectures/DNNs-2.pdf</ref> A value of 0 is essentially gradient descent without momentum and a momentum value of 0.9 tends to be a good default choice.<ref name=":2" /> Finding the optimal ''<math>\mu
</math>''  is a matter of trial and error to determine the best choice for a particular model such that the cost function is minimized.  


=== Problems with Gradient Descent ===
With gradient descent, a weight update at time ''t'' is given by the learning rate and gradient at that exact moment.<ref name=":1" /> This means that the previous steps are not considered when searching for the next iteration's solution.
[[File:Gradient descent momentum.png|thumb|Gradient descent with and without momentum.<ref name=":3">Srihari, S. (n.d.). ''Basic Optimization Algorithms''. Deep learning.https://cedar.buffalo.edu/~srihari/CSE676/8.3%20BasicOptimizn.pdf</ref>]]
This results in two issues:


Momentum is able to solve both of these issues but using an exponentially weighted average of the gradients to update the weights at each iteration. This method also prevents gradients of previous iterations to be weight equally. With an exponential weighted average, recent gradients are given more weightage than previous gradients. This can be seen in the numerical example below.
# Unlike convex functions, a non-convex cost function can have many local minima's meaning the first local minima found is not guaranteed to be the global minima. At the local minima, the gradient of the cost function will be very small resulting in no weight updates. Because of this, gradient descent will get stuck and fail to find the global optimal solution.
# Gradient descent can be noisy with many oscillations which results in a larger number of iterations needed to reach convergence.<ref name=":4" /><br />
Momentum is able to solve both of these issues buy using an exponentially weighted average of the gradients to update the weights at each iteration. This method also prevents gradients of previous iterations to be weighted equally. With an exponential weighted average, recent gradients are given more weight than previous gradients.  


==== Another heading ====
The graph to the right shows a comparison of gradient descent with and without momentum. Using momentum, the oscillations are significantly reduced and convergence to the minimum is achieved in fewer iterations.<ref name=":3" /> 
hi


==== Another heading ====
== Numerical Example ==
To demonstrate the use of momentum in the context of gradient descent, minimize the following function:


hi
<math>y=0.3*x^4 - 0.1*x^3 - 2*x^2 - 0.8*x</math>
=== SGD without Momentum ===
[[File:Without momentum.png|alt=|thumb|402x402px|Stochastic gradient descent without momentum stops at a local minimum.]]Solving this objective function using stochastic gradient descent, starting at <math>\theta_0=-2.8 
</math> and a learning rate of <math>\gamma=0.05 
</math>, gives a final solution of <math>x=-1.59 
</math> and <math>y=-1.46 
</math>. This example uses python to construct a loss function and optimizer with PyTorch.<ref>PyTorch Tutorials. https://pytorch.org/tutorials/beginner/basics/optimization_tutorial.html</ref> The solver iterates through the following equations:


'''Graphical Explanation'''
<math>g_i = \nabla\theta_i f_i (\theta_{i-1}) = 1.2*\theta_{i-1}^3 - 0.3*\theta_{i-1}^2 - 4*\theta_{i-1} - 0.8
[[File:Gradient descent nonconvex.png|thumb|A gradient descent scenario, without momentum, for a non-convex cost function. ]]
 
hi
</math>


==== hi ====
<math>\theta_i=\theta_{i-1}-\gamma*g_i 
hi
</math>
{| class="wikitable"
|+
!Iteration
!<math>g_i 
</math>
!<math>\theta_i 
</math>
|-
|1
|  -18.294399999999
|  -1.885279999999999
|-
|2
|  -2.3661413363687
|  -1.766972933181563
|-
|3
|  -1.2889636339668
|  -1.702524751483223
|-
|4
| -0.78138469777962
| -1.663455516594242
|}
The graph to the right depicts how the stochastic gradient descent problem will stop at a local minimum instead of finding the global minimum of the function. The magenta line represents the objective function and the blue line shows the points at each iteration. The following table lists the solution values for <math>x 
</math> and <math>y 
</math> for each iteration.
{| class="wikitable"
|+
!
!x
!y
|-
|1
|<nowiki>-2.800</nowiki>
|7.194
|-
|2
|<nowiki>-1.885</nowiki>
|<nowiki>-1.140</nowiki>
|-
|3
|  -1.766
|  -1.354
|-
|4
|  -1.702
|  -1.421
|-
|5
| -1.663
| -1.446
|-
|
|...
|...
|-
|99
|  -1.586
|  -1.464
|}
=== SGD with Momentum ===
[[File:With momentum.png|alt=|thumb|402x402px|Stochastic gradient descent with momentum stops at the global minimum.]]By adding momentum, the solver is able to overcome the hill past the local minimum and settle in the global maximum. The following example uses a momentum value of <math>\mu=0.7 
</math> and the same parameters as the previous example, with learning rate of <math>\gamma=0.05 
</math> and starting point of <math>\theta_0=-2.8 
</math>. The solver iterates through the following equations:


'''another header'''
<math>g_i = \nabla\theta_i f_i (\theta_{i-1}) = 1.2*\theta_{i-1}^3 - 0.3*\theta_{i-1}^2 - 4*\theta_{i-1} - 0.8
 
</math>


* hi
<math>b_i=\mu*b_{i-1}+g_i 
</math>


=== another heading ===
<math>\theta_i=\theta_{i-1}-\gamma*b_i 
</math>


* Blah:
The values of <math>b_i 
 
</math> use momentum and information from previous iterations. For the first iteration, <math>b_0=0 
'''Some Example'''
</math>. Therefore, the first iteration will have the same solution as the example without momentum.
 
{| class="wikitable"
* More text
|+
 
!Iteration
== Numerical Example ==
!<math>g_i</math>
To demonstrate the use of momentum in the context of gradient descent, minimize the following function:
!<math>b_i</math>
!<math>\theta_i</math>
|-
|1
| -18.29439999999999
|0
| -1.88527999999999
|-
|2
| -2.366141336368740
|<nowiki>-15.1722213363687</nowiki>
| -1.12666893318156
|-
|3
|1.609651754221803
| -9.01090318123631
| -0.67612377411974
|-
|4
|1.396449498127179
| -9.22410543733093
| -0.21491850225320
|}


<math>y=0.3*x^4 - 0.1*x^3 - 2*x^2 - 0.8*x</math>


=== SGD without Momentum ===
Note that momentum values less than 0.7 are too low and are unable to overcome the hill past the local minimum. The following table lists the values for <math>x 
Solving this objective function using stochastic gradient descent, starting at x=-2.8 and a learning rate of 0.05, gives a solution of y=-1.46 and x=-1.59. This example uses python to construct a loss function and optimizer with PyTorch<ref>https://pytorch.org/tutorials/beginner/basics/optimization_tutorial.html</ref>. The following graph depicts how the stochastic gradient descent problem will stop at a local minimum instead of finding the global minimum of the function. The magenta line represents the objective function and the blue line shows the points at each iteration.
</math> and <math>y 
[[File:NoMomentum.jpg|thumb|Stochastic gradient descent without momentum stops at a local minimum.|alt=|none|372x372px]]
</math> for each iteration after starting at <math>\theta_0=-2.8 
 
</math>. Using momentum, this solver gives the correct solution for the global minimum at <math>x=2.04 
=== SGD with Momentum ===
</math> and <math>y=-5.61
By adding momentum, the solver is able to overcome the hill past the local minimum and settle in the global maximum. The following example uses a momentum value of 0.7 and the same parameters as the previous example, with learning rate of 0.05 and starting point of x=2.8. Note that momentum values less than 0.7 are too low and are unable to overcome the hill past the local minimum. Using momentum, this solver gives the correct solution for the global minimum at y=-5.61 and x=2.04, shown in the graph below. The following table lists the values for x and y for each iteration after starting at x=-2.8.
</math>, as shown in the graph.
{| class="wikitable"
{| class="wikitable"
|+
|+
Line 114: Line 230:
|<nowiki>-5.608</nowiki>
|<nowiki>-5.608</nowiki>
|}
|}
== Applications ==


[[File:Momentum.jpg|thumb|Stochastic gradient descent with momentum stops at the global minimum.|alt=|none|371x371px]]
Momentum is widely used in the machine learning community for optimizing non-convex functions such as deep neural networks.<ref name=":2">Defazio, A. ''Momentum via primal averaging: Theoretical Insights and Learning Rate Schedules for Non-Convex Optimization''. https://arxiv.org/pdf/2010.00406.pdf</ref> Empirically, momentum methods outperform traditional stochastic gradient descent approaches.<ref name=":2" /> In deep learning, SGD is widely prevalent and is the underlying basis for many optimizers such as Adam, Adadelta, RMSProp, etc. which already utilize momentum to reduce computation speed.<ref>Rosebrock, A. (2021, May 7). ''Gradient descent algorithms and variations''. PyImageSearch. https://www.pyimagesearch.com/2021/05/05/gradient-descent-algorithms-and-variations/</ref> The momentum extension for optimization algorithms is available in many popular machine learning frameworks such as PyTorch, tensor flow, and scikit-learn. Generally, any problem that can be solved with stochastic gradient descent can benefit from the application of momentum. These are often unconstrained optimization problems. Some common SGD applications where momentum may be applied are ridge and logistic regression and support vector machines.<ref>Menon, A. (2009, February 27). ''Large-Scale Support Vector Machines: Algorithms and Theory''. https://cseweb.ucsd.edu//~akmenon/ResearchExamTalk.pdf</ref> Classification problems including those relating to cancer diagnosis<ref>Rehman, M. Z., & Nawi, N. M. (2011, June). The effect of adaptive momentum in improving the accuracy of gradient descent back propagation algorithm on classification problems. In ''International Conference on Software Engineering and Computer Systems'' (pp. 380-390). Springer, Berlin, Heidelberg.</ref> and image determination <ref>Zhang M., Smart W. (2004) Genetic Programming with Gradient Descent Search for Multiclass Object Classification. In: Keijzer M., O’Reilly UM., Lucas S., Costa E., Soule T. (eds) Genetic Programming. EuroGP 2004. Lecture Notes in Computer Science, vol 3003. Springer, Berlin, Heidelberg. <nowiki>https://doi.org/10.1007/978-3-540-24650-3_38</nowiki></ref> can also have reduced run times when momentum is implemented. In the case of medical diagnoses, this increased computation speed can directly benefit patients through faster diagnosis times and higher accuracy of diagnosis within the neural network. 
 
== Applications ==


Momentum is widely used in the machine learning community for training non-convex models such as deep neural networks<ref name=":2">https://arxiv.org/pdf/2010.00406.pdf</ref>. Empirically, momentum methods outperform traditional stochastic gradient descent approaches<ref name=":2" />. The momentum extension for optimization algorithms is available in many popular machine learning frameworks such as pytorch, tensor flow, and scikit-learn.
Momentum is also very useful in objective functions that have a large amount of curvature.<ref name=":0" /> These are areas where the gradients changes a lot. Since momentum leverages an exponential weighted average, smaller steps will be taken in these regions that have higher curvature resulting in dampened oscillations.  


In backpropgation,  
Overall, momentum may have farther reaching applicability to accelerating second order methods and deeper connections to other methods, such as the ellipsoid algorithm.<ref>Goh, G. (2017, April 4). ''Why momentum really works''. Distill. https://distill.pub/2017/momentum/</ref> 


== Conclusion ==
== Conclusion ==
hi
Momentum improves on gradient descent by reducing oscillatory effects and acting as an accelerator for optimization problem solving. Additionally, it finds the global (and not just local) optimum. Because of these advantages, momentum is commonly used in machine learning and has broad applications to all optimizers through SGD. Though the hyperparameters for momentum must be chosen with care and requires some trial and error, it ultimately addresses common issues seen in gradient descent problems. As deep learning continues to advance, momentum application will allow models and problems to be trained and solved faster than methods without. 


== References ==
== References ==

Latest revision as of 19:24, 15 December 2021

Authors: Thomas Lee, Greta Gasswint, and Elizabeth Henning (SYSEN5800 Fall 2021)

Introduction

Momentum is an extension to the gradient descent optimization algorithm that builds inertia in a search direction to overcome local minima and oscillation of noisy gradients.[1] It is based on the same concept of momentum in physics. A classical example of the concept is a ball rolling down a hill that gathers enough momentum to overcome a plateau region and make it to a global minima instead of getting stuck at a local minima. Momentum adds history to the parameter updates of descent problems which significantly accelerates the optimization process. The amount of history included in the update equation is determined via a hyperparameter.[1] This hyperparameter is a value ranging from 0 to 1, where a momentum value of 0 is equivalent to gradient descent without momentum.[1] A higher momentum value means more gradients from the past (history) are considered.[2]

Gradient descent also has several limitations which can be addressed using momentum application, including issues with noise in the oscillations and potential failure to solve due to non-convex functions having multiple local minima.[3][2] Momentum addresses these limitations using exponentially weighted gradients, avoiding errors caused by iterations being equally weighted.

Theory, methodology, and/or algorithmic discussions

Algorithm

The main idea behind momentum is to compute an exponentially weighted average of the gradients and use that to update the weights. By taking past gradients into account, the steps to gradient descent become smoothed out, which can reduce the amount of oscillations seen in iterations (where the iterative solutions are going between above and below the true curve). [4]

In (stochastic) gradient descent without momentum, the update rule at each iteration is given by:

Where:

  • denotes the parameters to the cost function
  • is the gradient indicating which direction to decrease the cost function by
  • is the hyperparameter representing the learning rate

In (stochastic) gradient descent with momentum, the update rule at each iteration is given by:

Where:

  • denotes the parameters to the cost function
  • is the gradient indicating which direction to decrease the cost function by
  • is the hyperparameter representing the learning rate
  • is the modified step direction term (as opposed to just using ) that incorporates momentum
  • is a new hyperparameter that denotes the momentum constant


Momentum can be applied to other gradient descent variations such as batch gradient descent and mini-batch gradient descent. Regardless of the gradient descent variation, the momentum hyperparameter must be tuned in addition to the learning rate hyperparameter . can be any value between 0 and 1. Generally, the larger the momentum value, the more past gradients must be taken into account, resulting in a smoother update. However, choosing a value that is too large will result in over smoothing. It is also possible for a large value to overshoot the target. As the value gets closer to 0, momentum behaves similarly to steepest descent.[5] A value of 0 is essentially gradient descent without momentum and a momentum value of 0.9 tends to be a good default choice.[6] Finding the optimal is a matter of trial and error to determine the best choice for a particular model such that the cost function is minimized.

Problems with Gradient Descent

With gradient descent, a weight update at time t is given by the learning rate and gradient at that exact moment.[2] This means that the previous steps are not considered when searching for the next iteration's solution.

Gradient descent with and without momentum.[7]

This results in two issues:

  1. Unlike convex functions, a non-convex cost function can have many local minima's meaning the first local minima found is not guaranteed to be the global minima. At the local minima, the gradient of the cost function will be very small resulting in no weight updates. Because of this, gradient descent will get stuck and fail to find the global optimal solution.
  2. Gradient descent can be noisy with many oscillations which results in a larger number of iterations needed to reach convergence.[3]

Momentum is able to solve both of these issues buy using an exponentially weighted average of the gradients to update the weights at each iteration. This method also prevents gradients of previous iterations to be weighted equally. With an exponential weighted average, recent gradients are given more weight than previous gradients.

The graph to the right shows a comparison of gradient descent with and without momentum. Using momentum, the oscillations are significantly reduced and convergence to the minimum is achieved in fewer iterations.[7]

Numerical Example

To demonstrate the use of momentum in the context of gradient descent, minimize the following function:

SGD without Momentum

Stochastic gradient descent without momentum stops at a local minimum.

Solving this objective function using stochastic gradient descent, starting at and a learning rate of , gives a final solution of and . This example uses python to construct a loss function and optimizer with PyTorch.[8] The solver iterates through the following equations:

Iteration
1 -18.294399999999 -1.885279999999999
2 -2.3661413363687 -1.766972933181563
3 -1.2889636339668 -1.702524751483223
4 -0.78138469777962 -1.663455516594242

The graph to the right depicts how the stochastic gradient descent problem will stop at a local minimum instead of finding the global minimum of the function. The magenta line represents the objective function and the blue line shows the points at each iteration. The following table lists the solution values for and for each iteration.

x y
1 -2.800 7.194
2 -1.885 -1.140
3 -1.766 -1.354
4 -1.702 -1.421
5 -1.663 -1.446
... ...
99 -1.586 -1.464

SGD with Momentum

Stochastic gradient descent with momentum stops at the global minimum.

By adding momentum, the solver is able to overcome the hill past the local minimum and settle in the global maximum. The following example uses a momentum value of and the same parameters as the previous example, with learning rate of and starting point of . The solver iterates through the following equations:

The values of use momentum and information from previous iterations. For the first iteration, . Therefore, the first iteration will have the same solution as the example without momentum.

Iteration
1 -18.29439999999999 0 -1.88527999999999
2 -2.366141336368740 -15.1722213363687 -1.12666893318156
3 1.609651754221803 -9.01090318123631 -0.67612377411974
4 1.396449498127179 -9.22410543733093 -0.21491850225320


Note that momentum values less than 0.7 are too low and are unable to overcome the hill past the local minimum. The following table lists the values for and for each iteration after starting at . Using momentum, this solver gives the correct solution for the global minimum at and , as shown in the graph.

x y
1 -2.800 7.194
2 -1.885 -1.140
3 -1.126 -1.011
4 -0.676 -0.279
... ...
99 2.042 -5.608

Applications

Momentum is widely used in the machine learning community for optimizing non-convex functions such as deep neural networks.[6] Empirically, momentum methods outperform traditional stochastic gradient descent approaches.[6] In deep learning, SGD is widely prevalent and is the underlying basis for many optimizers such as Adam, Adadelta, RMSProp, etc. which already utilize momentum to reduce computation speed.[9] The momentum extension for optimization algorithms is available in many popular machine learning frameworks such as PyTorch, tensor flow, and scikit-learn. Generally, any problem that can be solved with stochastic gradient descent can benefit from the application of momentum. These are often unconstrained optimization problems. Some common SGD applications where momentum may be applied are ridge and logistic regression and support vector machines.[10] Classification problems including those relating to cancer diagnosis[11] and image determination [12] can also have reduced run times when momentum is implemented. In the case of medical diagnoses, this increased computation speed can directly benefit patients through faster diagnosis times and higher accuracy of diagnosis within the neural network.

Momentum is also very useful in objective functions that have a large amount of curvature.[1] These are areas where the gradients changes a lot. Since momentum leverages an exponential weighted average, smaller steps will be taken in these regions that have higher curvature resulting in dampened oscillations.

Overall, momentum may have farther reaching applicability to accelerating second order methods and deeper connections to other methods, such as the ellipsoid algorithm.[13]

Conclusion

Momentum improves on gradient descent by reducing oscillatory effects and acting as an accelerator for optimization problem solving. Additionally, it finds the global (and not just local) optimum. Because of these advantages, momentum is commonly used in machine learning and has broad applications to all optimizers through SGD. Though the hyperparameters for momentum must be chosen with care and requires some trial and error, it ultimately addresses common issues seen in gradient descent problems. As deep learning continues to advance, momentum application will allow models and problems to be trained and solved faster than methods without.

References

  1. 1.0 1.1 1.2 1.3 Brownlee, J. (2021, October 11). Gradient descent with momentum from scratch. Machine Learning Mastery. https://machinelearningmastery.com/gradient-descent-with-momentum-from-scratch/.
  2. 2.0 2.1 2.2 Bhat, R. (2020, October 22). Gradient descent with momentum. Medium. https://towardsdatascience.com/gradient-descent-with-momentum-59420f626c8f
  3. 3.0 3.1 J. Sum, C. -S. Leung and K. Ho, "A Limitation of Gradient Descent Learning," in IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 6, pp. 2227-2232, June 2020, doi: 10.1109/TNNLS.2019.2927689.
  4. Kevin P. Murphy. Machine Learning : A Probabilistic Perspective. The MIT Press; 2012. https://search-ebscohost-com.proxy.library.cornell.edu/login.aspx?direct=true&db=nlebk&AN=480968&site=ehost-live
  5. Craven, M., & Page, D. (n.d.). Deep learning II: Momentum & Adaptive Step Size - UW–madison. https://www.biostat.wisc.edu/~craven/cs760/lectures/DNNs-2.pdf
  6. 6.0 6.1 6.2 Defazio, A. Momentum via primal averaging: Theoretical Insights and Learning Rate Schedules for Non-Convex Optimization. https://arxiv.org/pdf/2010.00406.pdf
  7. 7.0 7.1 Srihari, S. (n.d.). Basic Optimization Algorithms. Deep learning.https://cedar.buffalo.edu/~srihari/CSE676/8.3%20BasicOptimizn.pdf
  8. PyTorch Tutorials. https://pytorch.org/tutorials/beginner/basics/optimization_tutorial.html
  9. Rosebrock, A. (2021, May 7). Gradient descent algorithms and variations. PyImageSearch. https://www.pyimagesearch.com/2021/05/05/gradient-descent-algorithms-and-variations/
  10. Menon, A. (2009, February 27). Large-Scale Support Vector Machines: Algorithms and Theory. https://cseweb.ucsd.edu//~akmenon/ResearchExamTalk.pdf
  11. Rehman, M. Z., & Nawi, N. M. (2011, June). The effect of adaptive momentum in improving the accuracy of gradient descent back propagation algorithm on classification problems. In International Conference on Software Engineering and Computer Systems (pp. 380-390). Springer, Berlin, Heidelberg.
  12. Zhang M., Smart W. (2004) Genetic Programming with Gradient Descent Search for Multiclass Object Classification. In: Keijzer M., O’Reilly UM., Lucas S., Costa E., Soule T. (eds) Genetic Programming. EuroGP 2004. Lecture Notes in Computer Science, vol 3003. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24650-3_38
  13. Goh, G. (2017, April 4). Why momentum really works. Distill. https://distill.pub/2017/momentum/