RMSProp: Difference between revisions

Revision as of 09:55, 19 November 2020

Author: Jason Huang (SysEn 6800 Fall 2020)

Steward: Allen Yang, Fengqi You

Introduction

RMSProp, so call root mean square propagation, is an optimization algorithm/method dealing with Artificial Neural Network (ANN) for machine learning. It is also a currently developed algorithm compared to the Stochastic Gradient Descent (SGD) algorithm, momentum method. And even one of the foundations of Adam algorithm development. It is an unpublished optimization algorithm, using the adaptive learning rate method, first proposed in the Coursera course “Neural Network for Machine Learning” lecture six by Geoff Hinton. Astonished is that this informally revealed, an unpublished algorithm is intensely famous nowadays.

Theory and Methodology

Artificial Neural Network

Artificial Neural Network can be regarded as the human brain and conscious center of Aritifical Intelligence(AI), presenting the imitation of what the mind will be when human thinking. Scientists are trying to build the concept of ANN close real neurons with their biological ‘parent’.

A single neuron presented as a mathematic function

And the function of neurons can be presented as:

$f(x_{1},x_{2})=max(0,w_{1}x_{1}+w_{2}x_{2})$

Where $x_{1},x_{2}$ are two inputs numbers, and function $f(x_{1},x_{2})$ will takes these fixed inputs and create an output of single number. If $w_{1}x_{1}+w_{2}x_{2}$ is greater than 0, the function will return this positive value, or return 0 otherwise. Therefore, the neural network can be replaced as a coupled mathematical function, and its output of a previous function can be used as the next function input.

RProp

RProp, or we call Resilient Back Propagation, is the widely used algorithm for supervised learning with multi-layered feed-forward networks in the past. Besides, its concepts is the foundation of RMSPRop development t. The derivatives equation of error function can be represented as:

${\frac {\partial E}{\partial w_{ij}}}={\frac {\partial E}{\partial s_{i}}}{\frac {\partial s_{i}}{\partial net_{i}}}{\frac {\partial net_{i}}{\partial w_{ij}}}$

Where $w_{ij}$ is the weight from neuron $j$ to neuron $i$ , $s_{i}$ is the output , and $net_{i}$ is the weighted sum of the inputs of neurons $i$ . Once the weight of each partial derivatives is known, the error function can be presented by performing a simple gradient descent:

$w_{ij}(t+1)=w_{ij}(t)-\epsilon {\frac {\partial E}{\partial w_{ij}}}(t)$

(reference required)

Obviously, the choice of the learning rate $\epsilon$ , which scales the derivative, has an important effect on the time needed until convergence is reached. If it is set too small, too many steps are needed to reach an acceptable solution; on the contrary a large learning rate will possibly lead to oscillation, preventing the error to fall below a certain value.

In addition, RProp can combine the method with momentum method, to prevent above problem and to accelerate the convergence rate, the equation can rewrite as:

$\Delta w_{ij}(t)=\epsilon {\frac {\partial E}{\partial w_{ij}}}(t)+\Delta w_{ij}(t-1)$

However, It turns out that the optimal value of the momentum parameter $\mu$ is equally problem dependent as the learning rate $\epsilon$ , and that no general improvement can be accomplished. Besides, RProp algorithm is not function well when we have very large datasets and need to perform mini-batch weights updates.

RMSProp

RProp algorithm doesn’t work for mini-batches is that it violates the central idea behind stochastic gradient descent, which is when we have small enough learning rate, it averages the gradients over successive mini-batches. To solve this issue, consider the weight, that gets the gradient 0.1 on nine mini-batches, and the gradient of -0.9 on tenths mini-batch, RMSProp did force those gradients to roughly cancel each other out, so that the stay approximately the same.

By using the sign of gradient from RProp algorithm, and the mini-batches efficiency, and averaging over mini-batches which allows to combine gradients in the right way. PMSProp is keep the moving average of the squared gradients for each weight. And then we divide the gradient by square root the mean square.

The updated equation can be performed as:

$E[g^{2}](t)=\beta E[g^{2}](t-1)+(1-\beta )({\frac {\partial c}{\partial w}})^{2}$

$w_{ij}(t)=w_{ij}(t-1)-{\frac {\eta }{\sqrt {E[g^{2}]}}}{\frac {\partial c}{\partial w_{ij}}}$

where $E[g]$ is the moving average of squared gradients, $\delta c/\delta w$ is gradient of the cost function with respect to the weight, $\eta$ is the learning rate and $\beta$ is moving average parameter (default value — 0.9).

The equation adapt the learning rate by dividing by the squared gradients, However, since we only have the estimate of the gradient on the current mini-batch, we need instead to use the moving average, which is set as default 0.9, of it.

Numerical Example

2D RMSProp Example

For the 2-Dimension of how RMSProp function, refer to "Dive into Deep Learning website". In the link, the implementation of function $f(x)=0.1x_{1}^{1}+2x_{2}^{2}$ is well presented.

For the specific package supporting RMSProp in python, refer to Python keras.optimizers.RMSprop() Examples for both 3D and 2D implementation.

Applicants and Discussion

In the first visualization scheme, gradients based optimization algorithm has different convergence rate. As the visualizations shown, the without scaling based on gradient information algorithms are hard to break the symmetry and converge rapidly. RMSProp has a relative higher converge rate than SGD, Momentum and NAG, beginning descent faster, but it is slower and Ada-grad, Ada-delta, which are the Adam based algorithm. In conclusion, when handling the large scale/gradients problem, the scale gradients/step sizes like Ada-delta, Ada-grad and RMSProp perform better with high stability.

Ada-grad adaptive learning rate algorithms that looks a lot like RMSProp. Ada-grad adds element-wise scaling of the gradient based on the historical sum of squares in each dimension. This means that we keep a running sum of squared gradients. And then we adapt the learning rate by dividing it by that sum to get the result. Considering the concepts in RMSProp is widely used in other machine learning algorithm. We can say that it has high potential to coupled with other method such as Momentum,...etc. and probabaly can have a high efficiency performance in the future by effort.

Reference

1. Visualizing Optimization Algos

2. R Yamashita, M Nishio, R Kinh Gian, Convolutional neural networks: an overview and application in radiology (2018), 9:611–629

3. Vitaly Bushave, Understanding RMSprop — faster neural network learning (2018)

4. Vitaly Bushave, How do we ‘train’ neural networks ? (2017)

5. Sebastian Ruder, An overview of gradient descent optimization algorithms (2016)

6. Rinat Maksutov, Deep study of a not very deep neural network. Part 3a: Optimizers overview (2018)

7. Martin Riedmiller, H Braun, A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm (1993) 586-591

8. Dario Garcia-Gasulla, An Out-of-the-box Full-network Embedding for Convolutional Neural Networks (2018) 168-175

9. Neural Networks for Machine Learning, Geoffrey Hinton

10. Python keras.optimizers.RMSprop() Examples

11. RMSProp Algorithm Implementation Example