Momentum: Difference between revisions

Revision as of 22:34, 24 November 2021

Authors: Thomas Lee, Greta Gasswint, Elizabeth Henning (SYSEN5800 Fall 2021)

Introduction

Momentum is an extension to the gradient descent optimization algorithm that builds inertia in a search direction to overcome local minima and oscillation of noisy gradients (1). It is based on the same concept of momentum in physics. A classic example is a ball rolling down a hill that gathers enough momentum to overcome a plateau region and make it to a global minima (2). Momentum adds history to the parameter updates which significantly accelerates the optimization process. Momentum controls the amount of history to include in the update equation via a hyperparameter (1). This hyperparameter is a value ranging from 0 to 1. A momentum of 0 is equivalent to gradient descent without momentum (1). A higher momentum value means more gradients from the past (history) are considered (2).

(1) https://machinelearningmastery.com/gradient-descent-with-momentum-from-scratch/

(2) https://towardsdatascience.com/gradient-descent-with-momentum-59420f626c8f

Theory, methodology, and/or algorithmic discussions

Definition

hi

Algorithm

The main idea behind momentum is to compute an exponential moving average of the gradients and use that to update the weights.

In gradient descent (stochastic) without momentum, the update rule at each iteration is given by:

W = W - αdW

Where:

W denotes the parameters to the cost function
dW is the gradient indicating which direction to decrease the cost function by
α is the learning rate

In gradient descent (stochastic) with momentum, the update rule at each iteration is given by:

VdW = βVdW + (1-β)dW

W = W - αVdW

Where:

β is a new hyperparameter that denotes the momentum constant

Some heading

hi

Another heading

hi

Another heading

hi

Graphical Explanation

hi

another header

hi

another heading

Blah:

Some Example

More text

Numerical Example

Some header

hi

Applications

Some example

An example of this is

Conclusion

hi

@@ Line 3: / Line 3: @@
 == Introduction ==
 Momentum is an extension to the gradient descent optimization algorithm that builds inertia in a search direction to overcome local minima and oscillation of noisy gradients (1). It is based on the same concept of momentum in physics. A classic example is a ball rolling down a hill that gathers enough momentum to overcome a plateau region and make it to a global minima (2). Momentum adds history to the parameter updates which significantly accelerates the optimization process. Momentum controls the amount of history to include in the update equation via a hyperparameter (1). This hyperparameter is a value ranging from 0 to 1. A momentum of 0 is equivalent to gradient descent without momentum (1). A higher momentum value means more gradients from the past (history) are considered (2).
 (1) https://machinelearningmastery.com/gradient-descent-with-momentum-from-scratch/
-(2) https://towardsdatascience.com/stochastic-gradient-descent-with-momentum-a84097641a5d
+(2) https://towardsdatascience.com/gradient-descent-with-momentum-59420f626c8f
 == Theory, methodology, and/or algorithmic discussions ==
@@ Line 16: / Line 17: @@
 === Algorithm ===
 The main idea behind momentum is to compute an exponential moving average of the gradients and use that to update the weights.
 In gradient descent (stochastic) without momentum, the update rule at each iteration is given by:
 ''W = W - αdW''
 Where:
-* ''W'' denotes the parameters to the cost function
+*''W'' denotes the parameters to the cost function
-* ''dW''is the gradient indicating which direction to decrease the cost function by
+*''dW'' is the gradient indicating which direction to decrease the cost function by
-* ''α'' is the learning rate
+*''α'' is the learning rate
 In gradient descent (stochastic) with momentum, the update rule at each iteration is given by:
 ''VdW = βVdW + (1-β)dW''
 ''W = W - αVdW''
 Where:
 * ''β'' is a new hyperparameter that denotes the momentum constant

Momentum: Difference between revisions

Revision as of 22:34, 24 November 2021

Contents

Introduction

Theory, methodology, and/or algorithmic discussions

Definition

Algorithm

Some heading

Another heading

Another heading

hi

another heading

Numerical Example

Some header

Applications

Conclusion

References

Navigation menu