RMSProp: Difference between revisions

From Cornell University Computational Optimization Open Textbook - Optimization Wiki
Jump to navigation Jump to search
No edit summary
 
(14 intermediate revisions by 2 users not shown)
Line 1: Line 1:
Author: Jason Huang (SysEn 6800 Fall 2020)
Author: Jason Huang (SysEn 6800 Fall 2020)
Steward: Allen Yang, Fengqi You


== Introduction ==
== Introduction ==
RMSProp, so call root mean square propagation, is an optimization algorithm/method dealing with Artificial Neural Network (ANN) for machine learning. It is also a currently developed algorithm compared to the Stochastic Gradient Descent (SGD) algorithm, momentum method. And even one of the foundations of Adam algorithm development.
RMSProp, root mean square propagation, is an optimization algorithm/method designed for Artificial Neural Network (ANN) training. And it is an unpublished algorithm first proposed in the Coursera course. [https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf “Neural Network for Machine Learning”] lecture six by Geoff Hinton.<sup>[9]</sup> RMSProp lies in the realm of adaptive learning rate methods, which have been growing in popularity in recent years because it is the extension of Stochastic Gradient Descent (SGD) algorithm, momentum method, and the foundation of Adam algorithm. One of the applications of RMSProp is the stochastic technology for mini-batch gradient descent.
It is an unpublished optimization algorithm, using the adaptive learning rate method, first proposed in the Coursera course [https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf “Neural Network for Machine Learning” lecture six] by Geoff Hinton. Astonished is that this informally revealed, an unpublished algorithm is intensely famous nowadays.


==Theory and Methodology==
==Theory and Methodology==


=== Perceptron and Neural Networks ===
Perceptron is an algorithm used for supervised learning of binary classifier, and also can be regard as the simplify version/single layer of the Artificial Neural Network (ANN) to better understanding the neural network, which function is to imitate the human brain and conscious center function in Artificial Intelligence(AI) and present the small unit behavior in neural system when human thinking. The basis form of the perceptron consists inputs, weights, bias, net sum and activation function.
[[File:Screen Shot 2020-12-14 at 01.09.28.png|thumb|Basis form of perceptron ]]




'''Artificial Neural Network'''
The process of the perceptron is started by initiating input value <math>x_{1},x_{2} </math> and multiplying them by their weights to obtain <math>w_{1}, w_{2} </math>. All of the weights will be added up together to create the weight sum<math> \sum_i  w_{i} </math>. And the weighted sum is then applied to the activation function <math>f  </math> to produce the perceptron's output.


Artificial Neural Network can be regarded as the human brain and conscious center of Aritifical Intelligence(AI), presenting the imitation of what the mind will be when human thinking. Scientists are trying to build the concept of ANN close real neurons with their biological ‘parent’.
[[File:Neuron.png|thumb|A single neuron presented as a mathematic function ]]And the function of neurons can be presented as:


A neural network works similarly to the human brain’s neural network. A “neuron” in a neural network is a mathematical function that collects and classifies information according to a specific architecture. A neural network contains layers of interconnected nodes, which can be regards as the perception and is similar to the multiple linear regression. The perceptron transfers the signal by a multiple linear regression into an activation function which may be nonlinear.


=== '''RProp''' ===
RProp, or we call Resilient Back Propagation, is the widely used algorithm for supervised learning with multi-layered feed-forward networks. The basic concept of the backpropagation learning algorithm is the repeated application of the chain rule to compute the influence of each weight in the network with respect to an arbitrary error. The derivatives equation of error function can be represented as:
<math> \frac{\partial E}{\partial w_{ij}} =  \frac{\partial E}{\partial s_{i}} \frac{\partial s_{i}}{\partial net_{i}} \frac{\partial net_{i}}{\partial w_{ij}}</math>


<math>f (x_{1},x_{2}) = max(0, w_{1} x_{1} + w_{2}  x_{2}) </math>
Where <math>w_{ij}</math> is the weight from neuron <math>j</math> to neuron <math>i</math>, <math>s_{i}</math> is the output , and <math>net_{i}</math> is the weighted sum of the inputs of neurons <math>i</math>. Once the weight of each partial derivatives is known, the error function can be presented by performing a simple gradient descent:


<math>w_{ij}(t+1) = w_{ij}(t) - \epsilon \frac{\partial E}{\partial w_{ij}}(t)</math>


The choice of the learning rate <math>\epsilon</math>, which scales the derivative, has an important effect on the time needed until convergence is reached. If it is set too small, too many steps are needed to reach an acceptable solution; on the contrary, a large learning rate will possibly lead to oscillation, preventing the error to fall below a certain value<sup>[7]</sup>.


Where <math>x_{1},x_{2} </math> are two inputs numbers, and function <math>f (x_{1},x_{2}) </math> will takes these fixed inputs and create an output of single number. If <math>w_{1} x_{1} + w_{2}  x_{2} </math> is greater than 0, the function will return this positive value, or return 0 otherwise. Therefore, the neural network can be replaced as a coupled mathematical function, and its output of a previous function can be used as the next function input.
In addition, RProp can combine the method with momentum method, to prevent above problem and to accelerate the convergence rate, the equation can rewrite as:


'''RProp'''
<math> \Delta w_{ij}(t) = \epsilon \frac{\partial E}{\partial w_{ij}}(t) + \mu \Delta w_{ij}(t-1) </math>


RProp, or we call Resilient Back Propagation, is the widely used algorithm for supervised learning with multi-layered feed-forward networks in the past. Besides, its concepts is the foundation of RMSPRop development t. The derivatives equation of error function can be represented as:
However, It turns out that the optimal value of the momentum parameter <math>\mu</math> in above equation is equally problem dependent as the learning rate <math>\epsilon</math>, and that no general improvement can be accomplished. Besides, RProp algorithm is not function well when we have very large datasets and need to perform mini-batch weights updates. Therefore, scientist proposal a novel algorithm, RMSProp, which can cover more scenarios than RProp.


=== '''RMSProp''' ===
RProp algorithm does not work for mini-batches is because it violates the central idea behind stochastic gradient descent, when we have a small enough learning rate, it averages the gradients over successive mini-batches. To solve this issue, consider the weight, that gets the gradient 0.1 on nine mini-batches, and the gradient of -0.9 on tenths mini-batch, RMSProp did force those gradients to roughly cancel each other out, so that the stay approximately the same when computing.


By using the sign of gradient from RProp algorithm, and the mini-batches efficiency, and averaging over mini-batches which allows combining gradients in the right way. RMSProp keep moving average of the squared gradients for each weight. And then we divide the gradient by square root the mean square.


<math> \frac{\partial E}{\partial w_{ij}} = \frac{\partial E}{\partial s_{i}} \frac{\partial s_{i}}{\partial net_{i}} \frac{\partial net_{i}}{\partial w_{ij}}</math>
The updated equation can be performed as:
 
<math>E[g^2](t) =  \beta E[g^2](t-1) + (1- \beta) (\frac{\partial c}{\partial w})^2</math>
 
<math>w_{ij}(t) = w_{ij}(t-1) - \frac{ \eta }{ \sqrt{E[g^2]}} \frac{\partial c}{\partial w_{ij}} </math>


Where <math>w_{ij}</math> is the weight from neuron <math>j</math> to neuron <math>i</math>, <math>s_{i}</math> is the output , and <math>net_{i}</math> is the weighted sum of the inputs of neurons <math>i</math>. Once the weight of each partial derivatives is known, the error function can be presented by performing a simple gradient descent:
where <math>E[g] </math> is the moving average of squared gradients, <math> \delta c /  \delta w </math> is gradient of the cost function with respect to the weight, <math>\eta </math> is the learning rate and <math>\beta
</math> is moving average parameter (default value — 0.9, to make the sum of default gradient value 0.1 on nine mini-batches and -0.9 on tenths is approximate zero, and the default value <math>\eta </math> is 0.001 as per experience).


==Numerical Example==
For the simple unconstrained optimization problem <math>min f(x) = 0.1x_{1}^2 +2x_{2}^2 </math> :


<math>w_{ij}(t+1) = w_{ij}(t) - \epsilon \frac{\partial E}{\partial w_{ij}}(t)</math>
settle <math>\beta
</math> = 0.9, <math>\eta </math> = 0.4, , and transform the optimization problem to the standard RMSProp form, the equations are presented as below:
[[File:Trajectory.png|alt=the visualization of the trajectory of RMSProp algorithm|thumb|The visualization of the trajectory of RMSProp algorithm]]
<math>\frac{\partial c_{1}}{\partial w_{2}}, \frac{\partial c_{2}}{\partial w_{2}} = 0.2x_{1}, 4x_{2}


(reference required)
</math>


Obviously, the choice of the learning rate <math>\epsilon</math>, which scales the derivative, has an important effect on the time needed until convergence is reached. If it is set too small, too many steps are needed to reach an acceptable solution; on the contrary a large learning rate will possibly lead to oscillation, preventing the error to fall below a certain value.
<math>E_{1}(t) = 0.9 E_{1}(t-1) + (1 - 0.9)(\frac{\partial c_{1}}{\partial w_{1}})^2</math>  


In addition, RProp can combine the method with momentum method, to prevent above problem and to accelerate the convergence rate, the equation can rewrite as:
<math>E_{2}(t) = 0.9 E_{2}(t-1) + (1 - 0.9)(\frac{\partial c_{2}}{\partial w_{2}})^2</math>


<math> \Delta w_{ij}(t) = \epsilon \frac{\partial E}{\partial w_{ij}}(t) +  \Delta w_{ij}(t-1) </math>
<math>w_{1}(t) = w_{1}(t-1) - \frac{0.4}{ \sqrt{E_{1}}} \frac{\partial c_{1}}{\partial w_{1}}</math>  


However, It turns out that the optimal value of the momentum parameter <math>\mu</math> is equally problem dependent as the learning rate <math>\epsilon</math>, and that no general improvement can be accomplished. Besides, RProp algorithm is not function well when we have very large datasets and need to perform mini-batch weights updates.
<math>w_{2}(t) = w_{2}(t-1) - \frac{0.4}{ \sqrt{E_{1}}} \frac{\partial c_{2}}{\partial w_{2}}</math>  


while using programming language to help us to solve optimization problem and visualize the trajectory of RMSProp algorithm, we can observe that the curve converge to a certain point. For this particular question, minimize solution <math>0 </math> will be obtained with <math>(x_{1}, x_{2}) </math> is <math>(0, 0) </math>.  [[File:1 - 2dKCQHh - Long Valley.gif|thumb|Visualizing Optimization algorithm comparing convergence with similar algorithm<sup>[1]</sup>]] 


== Applications and Discussion ==
[[File:2 - pD0hWu5 - Beale's function.gif|thumb|Visualizing Optimization algorithm comparing convergence with similar algorithm<sup>[1]</sup>]]
The applications of RMSprop concentrate on the optimization with complex function like the neural network, or the non-convex optimization problem with adaptive learning rate, and widely used in the stochastic problem. The RMSprop optimizer restricts the oscillations in the vertical direction. Therefore, we can increase the learning rate or the algorithm could take larger steps in the horizontal direction converging to faster the similar approach gradient descent algorithm combine with momentum method.


'''RMSProp'''
In the first visualization scheme, the gradients based optimization algorithm has a different convergence rate. As the visualizations are shown, without scaling based on gradient information algorithms are hard to break the symmetry and converge rapidly. RMSProp has a relative higher converge rate than SGD, Momentum, and NAG, beginning descent faster, but it is slower than Ada-grad, Ada-delta, which are the Adam based algorithm. In conclusion, when handling the large scale/gradients problem, the scale gradients/step sizes like Ada-delta, Ada-grad, and RMSProp perform better with high stability.


RProp algorithm doesn’t work for mini-batches is that it violates the central idea behind stochastic gradient descent, which is when we have small enough learning rate, it averages the gradients over successive mini-batches. To solve this issue, consider the weight, that gets the gradient 0.1 on nine mini-batches, and the gradient of -0.9 on tenths mini-batch, RMSProp did force those gradients to roughly cancel each other out, so that the stay approximately the same.
Ada-grad adaptive learning rate algorithms that look a lot like RMSProp. Ada-grad adds element-wise scaling of the gradient-based on the historical sum of squares in each dimension. This means that we keep a running sum of squared gradients, and then we adapt the learning rate by dividing it by the sum to get the result. Considering the concepts in RMSProp widely used in other machine learning algorithms, we can say that it has high potential to coupled with other methods such as momentum,...etc. 


By using the sign of gradient from RProp algorithm, and the mini-batches efficiency, and averaging over mini-batches which allows to combine gradients in the right way. PMSProp is keep the moving average of the squared gradients for each weight. And then we divide the gradient by square root the mean square.
== Conclusion==
RMSProp, root mean squared propagation is the optimization machine learning algorithm to train the Artificial Neural Network (ANN)  by different adaptive learning rate and derived from the concepts of gradients descent and RProp. Combining averaging over mini-batches, efficiency, and the gradients over successive mini-batches, RMSProp can reach the faster convergence rate than the original optimizer, but lower than the advanced optimizer such as Adam. As knowing the high performance of RMSProp and possibility of combining with other algorithm, harder problem could be better described and converged in the future.


The updated equation can be performed as:
==Reference==


<math>E[g^2](t) =  \beta E[g^2](t-1) + (1- \beta) (\frac{\partial c}{\partial w})^2</math>
1. A. Radford, "[https://imgur.com/a/Hqolp#NKsFHJb Visualizing Optimization Algos (open sourse)".]  


<math>w_{ij}(t) = w_{ij}(t-1) - \frac{ \eta }{ \sqrt{E[g^2]}}  \frac{\partial c}{\partial w_{ij}} </math>
2. R. Yamashita, M Nishio and R KGian, "Convolutional neural networks: an overview and application in radiology", pp. 9:611–629, 2018.[[File:3 - NKsFHJb - Saddle Point.gif|thumb|Visualizing Optimization algorithm comparing convergence with similar algorithm<sup>[1]</sup>]]3. V. Bushave, "Understanding RMSprop — faster neural network learning", 2018.


where <math>E[g] </math> is the moving average of squared gradients, <math> \delta c /  \delta w </math> is gradient of the cost function with respect to the weight, <math>\eta </math> is the learning rate and <math>\beta
4. V. Bushave, "How do we ‘train’ neural networks ?", 2017.
</math> is moving average parameter (default value — 0.9).


The equation adapt the learning rate by dividing by the squared gradients, However, since we only have the estimate of the gradient on the current mini-batch, we need instead to use the moving average, which is set as default 0.9, of it.
5. S. Ruder, "An overview of gradient descent optimization algorithms" ,2016.


==Numerical Example==
6. R. Maksutov, "Deep study of a not very deep neural network. Part 3a: Optimizers overview", 2018.


7. M. Riedmiller, H Braun, "A Direct Adaptive Method for Faster Back-propagation Learning: The RPROP Algorithm", pp.586-591, 1993.


8. D. Garcia-Gasulla, "An Out-of-the-box Full-network Embedding for Convolutional Neural Networks" pp.168-175, 2018.


'''2D RMSProp Example'''
9. [https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Geoffrey Hinton, "Coursera Neural Networks for Machine Learning lecture 6", 2018.]


For the 2-Dimension of how RMSProp function, refer to "[https://d2l.ai/chapter_optimization/rmsprop.html Dive into Deep Learning website]". In the link, the implementation of function <math>f(x) = 0.1x_{1}^1 +2x_{2}^2 </math> is well presented.  
10. [https://www.programcreek.com/python/example/104283/keras.optimizers.RMSprop Python keras.optimizers.RMSprop() Examples.]


For the specific package supporting RMSProp in python, refer to [https://www.programcreek.com/python/example/104283/keras.optimizers.RMSprop Python keras.optimizers.RMSprop() Examples] for both 3D and 2D implementation.
11. [https://d2l.ai/chapter_optimization/rmsprop.html RMSProp Algorithm Implementation Example.]


== Applicants and Discussion ==
12. S.De, A. Mukherjee, and E. Ullah, "Convergence guarantees for RMSProp and Adam in non-convex optimization and and empirical  comparison to Nesterov acceleration", conference paper at ICLR, 2019.
[[File:1 - 2dKCQHh - Long Valley.gif|thumb|Visualizing Optimization algorithm comparing convergence with similar algorithm<sup>1</sup>]]
[[File:2 - pD0hWu5 - Beale's function.gif|thumb|Visualizing Optimization algorithm comparing covergence with similar algorithm<sup>1</sup>]]
gradients based optimization algorithm has different convergence rate. As the visualizations shown, the without scaling based on gradient information algorithms are hard to break the symmetry and converge rapidly. RMSProp has a relative converge rate than others, but it is not the fastest one.

Latest revision as of 07:05, 21 December 2020

Author: Jason Huang (SysEn 6800 Fall 2020)

Introduction

RMSProp, root mean square propagation, is an optimization algorithm/method designed for Artificial Neural Network (ANN) training. And it is an unpublished algorithm first proposed in the Coursera course. “Neural Network for Machine Learning” lecture six by Geoff Hinton.[9] RMSProp lies in the realm of adaptive learning rate methods, which have been growing in popularity in recent years because it is the extension of Stochastic Gradient Descent (SGD) algorithm, momentum method, and the foundation of Adam algorithm. One of the applications of RMSProp is the stochastic technology for mini-batch gradient descent.

Theory and Methodology

Perceptron and Neural Networks

Perceptron is an algorithm used for supervised learning of binary classifier, and also can be regard as the simplify version/single layer of the Artificial Neural Network (ANN) to better understanding the neural network, which function is to imitate the human brain and conscious center function in Artificial Intelligence(AI) and present the small unit behavior in neural system when human thinking. The basis form of the perceptron consists inputs, weights, bias, net sum and activation function.

Basis form of perceptron


The process of the perceptron is started by initiating input value and multiplying them by their weights to obtain . All of the weights will be added up together to create the weight sum. And the weighted sum is then applied to the activation function to produce the perceptron's output.


A neural network works similarly to the human brain’s neural network. A “neuron” in a neural network is a mathematical function that collects and classifies information according to a specific architecture. A neural network contains layers of interconnected nodes, which can be regards as the perception and is similar to the multiple linear regression. The perceptron transfers the signal by a multiple linear regression into an activation function which may be nonlinear.

RProp

RProp, or we call Resilient Back Propagation, is the widely used algorithm for supervised learning with multi-layered feed-forward networks. The basic concept of the backpropagation learning algorithm is the repeated application of the chain rule to compute the influence of each weight in the network with respect to an arbitrary error. The derivatives equation of error function can be represented as:

Where is the weight from neuron to neuron , is the output , and is the weighted sum of the inputs of neurons . Once the weight of each partial derivatives is known, the error function can be presented by performing a simple gradient descent:

The choice of the learning rate , which scales the derivative, has an important effect on the time needed until convergence is reached. If it is set too small, too many steps are needed to reach an acceptable solution; on the contrary, a large learning rate will possibly lead to oscillation, preventing the error to fall below a certain value[7].

In addition, RProp can combine the method with momentum method, to prevent above problem and to accelerate the convergence rate, the equation can rewrite as:

However, It turns out that the optimal value of the momentum parameter in above equation is equally problem dependent as the learning rate , and that no general improvement can be accomplished. Besides, RProp algorithm is not function well when we have very large datasets and need to perform mini-batch weights updates. Therefore, scientist proposal a novel algorithm, RMSProp, which can cover more scenarios than RProp.

RMSProp

RProp algorithm does not work for mini-batches is because it violates the central idea behind stochastic gradient descent, when we have a small enough learning rate, it averages the gradients over successive mini-batches. To solve this issue, consider the weight, that gets the gradient 0.1 on nine mini-batches, and the gradient of -0.9 on tenths mini-batch, RMSProp did force those gradients to roughly cancel each other out, so that the stay approximately the same when computing.

By using the sign of gradient from RProp algorithm, and the mini-batches efficiency, and averaging over mini-batches which allows combining gradients in the right way. RMSProp keep moving average of the squared gradients for each weight. And then we divide the gradient by square root the mean square.

The updated equation can be performed as:

where is the moving average of squared gradients, is gradient of the cost function with respect to the weight, is the learning rate and is moving average parameter (default value — 0.9, to make the sum of default gradient value 0.1 on nine mini-batches and -0.9 on tenths is approximate zero, and the default value is 0.001 as per experience).

Numerical Example

For the simple unconstrained optimization problem  :

settle = 0.9, = 0.4, , and transform the optimization problem to the standard RMSProp form, the equations are presented as below:

the visualization of the trajectory of RMSProp algorithm
The visualization of the trajectory of RMSProp algorithm

while using programming language to help us to solve optimization problem and visualize the trajectory of RMSProp algorithm, we can observe that the curve converge to a certain point. For this particular question, minimize solution will be obtained with is .

Visualizing Optimization algorithm comparing convergence with similar algorithm[1]

Applications and Discussion

Visualizing Optimization algorithm comparing convergence with similar algorithm[1]

The applications of RMSprop concentrate on the optimization with complex function like the neural network, or the non-convex optimization problem with adaptive learning rate, and widely used in the stochastic problem. The RMSprop optimizer restricts the oscillations in the vertical direction. Therefore, we can increase the learning rate or the algorithm could take larger steps in the horizontal direction converging to faster the similar approach gradient descent algorithm combine with momentum method.

In the first visualization scheme, the gradients based optimization algorithm has a different convergence rate. As the visualizations are shown, without scaling based on gradient information algorithms are hard to break the symmetry and converge rapidly. RMSProp has a relative higher converge rate than SGD, Momentum, and NAG, beginning descent faster, but it is slower than Ada-grad, Ada-delta, which are the Adam based algorithm. In conclusion, when handling the large scale/gradients problem, the scale gradients/step sizes like Ada-delta, Ada-grad, and RMSProp perform better with high stability.

Ada-grad adaptive learning rate algorithms that look a lot like RMSProp. Ada-grad adds element-wise scaling of the gradient-based on the historical sum of squares in each dimension. This means that we keep a running sum of squared gradients, and then we adapt the learning rate by dividing it by the sum to get the result. Considering the concepts in RMSProp widely used in other machine learning algorithms, we can say that it has high potential to coupled with other methods such as momentum,...etc.

Conclusion

RMSProp, root mean squared propagation is the optimization machine learning algorithm to train the Artificial Neural Network (ANN)  by different adaptive learning rate and derived from the concepts of gradients descent and RProp. Combining averaging over mini-batches, efficiency, and the gradients over successive mini-batches, RMSProp can reach the faster convergence rate than the original optimizer, but lower than the advanced optimizer such as Adam. As knowing the high performance of RMSProp and possibility of combining with other algorithm, harder problem could be better described and converged in the future.

Reference

1. A. Radford, "Visualizing Optimization Algos (open sourse)".

2. R. Yamashita, M Nishio and R KGian, "Convolutional neural networks: an overview and application in radiology", pp. 9:611–629, 2018.

Visualizing Optimization algorithm comparing convergence with similar algorithm[1]

3. V. Bushave, "Understanding RMSprop — faster neural network learning", 2018.

4. V. Bushave, "How do we ‘train’ neural networks ?", 2017.

5. S. Ruder, "An overview of gradient descent optimization algorithms" ,2016.

6. R. Maksutov, "Deep study of a not very deep neural network. Part 3a: Optimizers overview", 2018.

7. M. Riedmiller, H Braun, "A Direct Adaptive Method for Faster Back-propagation Learning: The RPROP Algorithm", pp.586-591, 1993.

8. D. Garcia-Gasulla, "An Out-of-the-box Full-network Embedding for Convolutional Neural Networks" pp.168-175, 2018.

9. Geoffrey Hinton, "Coursera Neural Networks for Machine Learning lecture 6", 2018.

10. Python keras.optimizers.RMSprop() Examples.

11. RMSProp Algorithm Implementation Example.

12. S.De, A. Mukherjee, and E. Ullah, "Convergence guarantees for RMSProp and Adam in non-convex optimization and and empirical comparison to Nesterov acceleration", conference paper at ICLR, 2019.