RMSProp - Revision history

Wc593 at 11:05, 21 December 2020

2020-12-21T11:05:24Z

← Older revision		Revision as of 07:05, 21 December 2020
Line 2:		Line 2:

	== Introduction ==		== Introduction ==
	RMSProp, root mean square propagation, is an optimization algorithm/method designed for Artificial Neural Network (ANN) training. And it is an unpublished algorithm first proposed in the Coursera course [https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf “Neural Network for Machine Learning”] lecture six by Geoff Hinton<sup>[9]</sup>. RMSProp lies in the realm of adaptive learning rate methods, which have been growing in popularity in recent years because it is the extension of Stochastic Gradient Descent (SGD) algorithm, momentum method, and the foundation of Adam algorithm. One of the applications of RMSProp is the stochastic technology for mini-batch gradient descent.		RMSProp, root mean square propagation, is an optimization algorithm/method designed for Artificial Neural Network (ANN) training. And it is an unpublished algorithm first proposed in the Coursera course. [https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf “Neural Network for Machine Learning”] lecture six by Geoff Hinton.<sup>[9]</sup> RMSProp lies in the realm of adaptive learning rate methods, which have been growing in popularity in recent years because it is the extension of Stochastic Gradient Descent (SGD) algorithm, momentum method, and the foundation of Adam algorithm. One of the applications of RMSProp is the stochastic technology for mini-batch gradient descent.

	==Theory and Methodology==		==Theory and Methodology==

Jason Huang: /* Numerical Example */

2020-12-14T04:48:22Z

Numerical Example

← Older revision		Revision as of 00:48, 14 December 2020
Line 61:		Line 61:
	<math>E_{2}(t) = 0.9 E_{2}(t-1) + (1 - 0.9)(\frac{\partial c_{2}}{\partial w_{2}})^2</math>		<math>E_{2}(t) = 0.9 E_{2}(t-1) + (1 - 0.9)(\frac{\partial c_{2}}{\partial w_{2}})^2</math>

	<math>w_{1}(t) = w_{1}(t-1) - \frac{- 0.4}{ \sqrt{E_{1}}} \frac{\partial c_{1}}{\partial w_{1}}</math>		<math>w_{1}(t) = w_{1}(t-1) - \frac{0.4}{ \sqrt{E_{1}}} \frac{\partial c_{1}}{\partial w_{1}}</math>

	<math>w_{2}(t) = w_{2}(t-1) - \frac{- 0.4}{ \sqrt{E_{1}}} \frac{\partial c_{2}}{\partial w_{2}}</math>		<math>w_{2}(t) = w_{2}(t-1) - \frac{0.4}{ \sqrt{E_{1}}} \frac{\partial c_{2}}{\partial w_{2}}</math>

	while using programming language to help us to visualize the trajectory of RMSProp algorithm, we can observe that the curve converge to a certain point. For this particular question, minimize solution <math>0 </math> will be obtained ~~where~~ <math>(x_{1}, x_{2}) </math> is <math>(0, 0) </math>. [[File:1 - 2dKCQHh - Long Valley.gif\|thumb\|Visualizing Optimization algorithm comparing convergence with similar algorithm<sup>[1]</sup>]]		while using programming language to help us to solve optimization problem and visualize the trajectory of RMSProp algorithm, we can observe that the curve converge to a certain point. For this particular question, minimize solution <math>0 </math> will be obtained with <math>(x_{1}, x_{2}) </math> is <math>(0, 0) </math>. [[File:1 - 2dKCQHh - Long Valley.gif\|thumb\|Visualizing Optimization algorithm comparing convergence with similar algorithm<sup>[1]</sup>]]

	== Applications and Discussion ==		== Applications and Discussion ==

Jason Huang at 18:29, 13 December 2020

2020-12-13T18:29:21Z

Jason Huang at 18:25, 13 December 2020

2020-12-13T18:25:02Z

← Older revision		Revision as of 14:25, 13 December 2020
Line 80:		Line 80:
	==Reference==		==Reference==

	1. A. Radford, "[https://imgur.com/a/Hqolp#NKsFHJb Visualizing Optimization Algos (open sourse)"]		1. A. Radford, "[https://imgur.com/a/Hqolp#NKsFHJb Visualizing Optimization Algos (open sourse)".]

	2. R. Yamashita, M Nishio and R KGian, "Convolutional neural networks: an overview and application in radiology", pp. 9:611–629, 2018[[File:3 - NKsFHJb - Saddle Point.gif\|thumb\|Visualizing Optimization algorithm comparing convergence with similar algorithm<sup>1</sup>]]3. V. Bushave, "Understanding RMSprop — faster neural network learning", 2018.		2. R. Yamashita, M Nishio and R KGian, "Convolutional neural networks: an overview and application in radiology", pp. 9:611–629, 2018.[[File:3 - NKsFHJb - Saddle Point.gif\|thumb\|Visualizing Optimization algorithm comparing convergence with similar algorithm<sup>1</sup>]]3. V. Bushave, "Understanding RMSprop — faster neural network learning", 2018.

	4. V. Bushave, "How do we ‘train’ neural networks ?", 2017.		4. V. Bushave, "How do we ‘train’ neural networks ?", 2017.
Line 94:		Line 94:
	8. D. Garcia-Gasulla, "An Out-of-the-box Full-network Embedding for Convolutional Neural Networks" pp.168-175, 2018.		8. D. Garcia-Gasulla, "An Out-of-the-box Full-network Embedding for Convolutional Neural Networks" pp.168-175, 2018.

	9. [https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Geoffrey Hinton, "Coursera Neural Networks for Machine Learning lecture 6", 2018]		9. [https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Geoffrey Hinton, "Coursera Neural Networks for Machine Learning lecture 6", 2018.]

	10. [https://www.programcreek.com/python/example/104283/keras.optimizers.RMSprop Python keras.optimizers.RMSprop() Examples]		10. [https://www.programcreek.com/python/example/104283/keras.optimizers.RMSprop Python keras.optimizers.RMSprop() Examples.]

	11. [https://d2l.ai/chapter_optimization/rmsprop.html RMSProp Algorithm Implementation Example]		11. [https://d2l.ai/chapter_optimization/rmsprop.html RMSProp Algorithm Implementation Example.]

	12. S.De, A. Mukherjee, and E. Ullah, "Convergence guarantees for RMSProp and Adam in non-convex optimization and and empirical comparison to Nesterov acceleration", conference paper at ICLR, 2019		12. S.De, A. Mukherjee, and E. Ullah, "Convergence guarantees for RMSProp and Adam in non-convex optimization and and empirical comparison to Nesterov acceleration", conference paper at ICLR, 2019.

Jason Huang at 18:23, 13 December 2020

2020-12-13T18:23:06Z

Show changes

Jason Huang at 16:19, 12 December 2020

2020-12-12T16:19:21Z

Show changes

Jason Huang at 10:00, 12 December 2020

2020-12-12T10:00:45Z

← Older revision		Revision as of 06:00, 12 December 2020
Line 28:		Line 28:
	<math>w_{ij}(t+1) = w_{ij}(t) - \epsilon \frac{\partial E}{\partial w_{ij}}(t)</math>		<math>w_{ij}(t+1) = w_{ij}(t) - \epsilon \frac{\partial E}{\partial w_{ij}}(t)</math>

	~~Obviously, the~~ choice of the learning rate <math>\epsilon</math>, which scales the derivative, has an important effect on the time needed until convergence is reached. If it is set too small, too many steps are needed to reach an acceptable solution; on the contrary, a large learning rate will possibly lead to oscillation, preventing the error to fall below a certain value<sup>7</sup>.		The choice of the learning rate <math>\epsilon</math>, which scales the derivative, has an important effect on the time needed until convergence is reached. If it is set too small, too many steps are needed to reach an acceptable solution; on the contrary, a large learning rate will possibly lead to oscillation, preventing the error to fall below a certain value<sup>7</sup>.

	In addition, RProp can combine the method with momentum method, to prevent above problem and to accelerate the convergence rate, the equation can rewrite as:		In addition, RProp can combine the method with momentum method, to prevent above problem and to accelerate the convergence rate, the equation can rewrite as:


	~~<math> \Delta w_{ij}(t) = \epsilon \frac{\partial E}{\partial w_{ij}}(t) + \Delta w_{ij}(t-1) </math>~~

			<math> \Delta w_{ij}(t) = \epsilon \frac{\partial E}{\partial w_{ij}}(t) + \mu \Delta w_{ij}(t-1) </math>

	However, It turns out that the optimal value of the momentum parameter <math>\mu</math> is equally problem dependent as the learning rate <math>\epsilon</math>, and that no general improvement can be accomplished. Besides, RProp algorithm is not function well when we have very large datasets and need to perform mini-batch weights updates.

			However, It turns out that the optimal value of the momentum parameter <math>\mu</math> in above equation is equally problem dependent as the learning rate <math>\epsilon</math>, and that no general improvement can be accomplished. Besides, RProp algorithm is not function well when we have very large datasets and need to perform mini-batch weights updates. Therefore, scientist proposal a novel algorithm, RMSProp, which can cover more scenarios than RProp.

	=== '''RMSProp''' ===		=== '''RMSProp''' ===
	RProp algorithm ~~doesn’t~~ work for mini-batches is that it violates the central idea behind stochastic gradient descent, which is when we have a small enough learning rate, it averages the gradients over successive mini-batches. To solve this issue, consider the weight, that gets the gradient 0.1 on nine mini-batches, and the gradient of -0.9 on tenths mini-batch, RMSProp did force those gradients to roughly cancel each other out, so that the stay approximately the same.		RProp algorithm does not work for mini-batches is that it violates the central idea behind stochastic gradient descent, which is when we have a small enough learning rate, it averages the gradients over successive mini-batches. To solve this issue, consider the weight, that gets the gradient 0.1 on nine mini-batches, and the gradient of -0.9 on tenths mini-batch, RMSProp did force those gradients to roughly cancel each other out, so that the stay approximately the same.

	By using the sign of gradient from RProp algorithm, and the mini-batches efficiency, and averaging over mini-batches which allows combining gradients in the right way. RMSProp is keeping the moving average of the squared gradients for each weight. And then we divide the gradient by square root the mean square.		By using the sign of gradient from RProp algorithm, and the mini-batches efficiency, and averaging over mini-batches which allows combining gradients in the right way. RMSProp is keeping the moving average of the squared gradients for each weight. And then we divide the gradient by square root the mean square.

Jason Huang at 09:12, 12 December 2020

2020-12-12T09:12:39Z

← Older revision		Revision as of 05:12, 12 December 2020
Line 41:		Line 41:
	RProp algorithm doesn’t work for mini-batches is that it violates the central idea behind stochastic gradient descent, which is when we have a small enough learning rate, it averages the gradients over successive mini-batches. To solve this issue, consider the weight, that gets the gradient 0.1 on nine mini-batches, and the gradient of -0.9 on tenths mini-batch, RMSProp did force those gradients to roughly cancel each other out, so that the stay approximately the same.		RProp algorithm doesn’t work for mini-batches is that it violates the central idea behind stochastic gradient descent, which is when we have a small enough learning rate, it averages the gradients over successive mini-batches. To solve this issue, consider the weight, that gets the gradient 0.1 on nine mini-batches, and the gradient of -0.9 on tenths mini-batch, RMSProp did force those gradients to roughly cancel each other out, so that the stay approximately the same.

	By using the sign of gradient from RProp algorithm, and the mini-batches efficiency, and averaging over mini-batches which allows combining gradients in the right way. ~~PMSProp~~ is keeping the moving average of the squared gradients for each weight. And then we divide the gradient by square root the mean square.		By using the sign of gradient from RProp algorithm, and the mini-batches efficiency, and averaging over mini-batches which allows combining gradients in the right way. RMSProp is keeping the moving average of the squared gradients for each weight. And then we divide the gradient by square root the mean square.

	The updated equation can be performed as:		The updated equation can be performed as:

Jason Huang at 09:01, 12 December 2020

2020-12-12T09:01:24Z

← Older revision		Revision as of 05:01, 12 December 2020
Line 6:		Line 6:
	==Theory and Methodology==		==Theory and Methodology==

	=== Perceptron ===		=== Perceptron and Neural Networks ===
	Perceptron is an algorithm used for supervised learning of binary classifier, and also can be ~~treated~~ as the simplify version/single layer of the Artificial Neural Network ~~can be regarded as~~ the human brain and conscious center ~~of Aritifical~~ Intelligence(AI), presenting the imitation of what the mind will be when human thinking. ~~Scientists are trying to build the concept~~ of ~~ANN close real neurons with their biological ‘parent’.~~		Perceptron is an algorithm used for supervised learning of binary classifier, and also can be regard as the simplify version/single layer of the Artificial Neural Network to better understanding the neural network, which is to perform the human brain and conscious center in Artificial Intelligence(AI) and presenting the imitation of what the mind will like when human thinking. The basis form of the perceptron consists inputs, weights, bias, net sum and activation function.
	~~[[File:Neuron.png\|thumb\|A single neuron presented as a mathematic function ]]And~~ the function ~~of neurons can be presented as:~~


	<math>~~f (~~x_{1},x_{2}~~) = max(0,~~ w_{1} ~~x_{1} +~~ w_{2} x_{2}) </math>		The process of the perceptron is start by initiating input value <math>x_{1},x_{2} </math> and multiplying them by their weights to obtain <math>w_{1}, w_{2} </math>. All of the weights will be added up together to create the weight sum<math> \sum_i w_{i} </math>. And the weighted sum is then applied to the activation function <math>f </math> to produce the perceptron's output.
			[[File:Neuron.png\|thumb\|A single neuron presented as a mathematic function ]]


	Where <math>x_{1},x_{2} </math> are two inputs numbers, and function <math>f (x_{1},x_{2}) </math> will takes these fixed inputs and create an output of single number. If <math>w_{1} x_{1} + w_{2} x_{2} </math> is greater than 0, the ~~function will return this positive value, or return 0 otherwise~~. ~~Therefore, the~~ neural network ~~can be replaced as~~ a ~~coupled~~ mathematical function, and ~~its output~~ of ~~a previous function~~ can be ~~used~~ as the ~~next~~ function ~~input~~.		A neural network works similarly to the human brain’s neural network. A “neuron” in a neural network is a mathematical function that collects and classifies information according to a specific architecture. A neural network contains layers of interconnected nodes, which can be regards as the perception and is similar to the multiple linear regression. The perceptron transfers the signal by a multiple linear regression into an activation function which may be nonlinear.

	=== '''RProp''' ===		=== '''RProp''' ===

Jason Huang at 05:02, 12 December 2020

2020-12-12T05:02:54Z

← Older revision		Revision as of 01:02, 12 December 2020
Line 2:		Line 2:

	== Introduction ==		== Introduction ==
	RMSProp, so call root mean square propagation, is an optimization algorithm/method ~~dealing with~~ Artificial Neural Network (ANN) ~~for machine learning. It is also a currently developed algorithm compared to the Stochastic Gradient Descent (SGD) algorithm, momentum method~~. And ~~even one of the foundations of Adam algorithm development.~~		RMSProp, so call root mean square propagation, is an optimization algorithm/method designed for Artificial Neural Network (ANN) training. And it is an unpublished algorithm first proposed in the Coursera course [https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf “Neural Network for Machine Learning”] lecture six by Geoff Hinton. RMSProp lies in the realm of adaptive learning rate methods, which have been growing in popularity in recent years because it is the extension of Stochastic Gradient Descent (SGD) algorithm, momentum method, and the foundation of Adam algorithm. One of the application of RMSProp is the stochastic technology for mini-batch gradient descent.
	It is an unpublished ~~optimization~~ algorithm~~, using the adaptive learning rate method,~~ first proposed in the Coursera course [https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf “Neural Network for Machine Learning” lecture six] by Geoff Hinton. ~~Astonished~~ is ~~that this informally revealed~~, ~~an unpublished~~ algorithm is ~~intensely famous nowadays~~.

	==Theory and Methodology==		==Theory and Methodology==

	=== ~~'''Artificial Neural Network'''~~ ===		=== Perceptron ===
	Artificial Neural Network can be regarded as the human brain and conscious center of Aritifical Intelligence(AI), presenting the imitation of what the mind will be when human thinking. Scientists are trying to build the concept of ANN close real neurons with their biological ‘parent’.		Perceptron is an algorithm used for supervised learning of binary classifier, and also can be treated as the simplify version/single layer of the Artificial Neural Network can be regarded as the human brain and conscious center of Aritifical Intelligence(AI), presenting the imitation of what the mind will be when human thinking. Scientists are trying to build the concept of ANN close real neurons with their biological ‘parent’.
	[[File:Neuron.png\|thumb\|A single neuron presented as a mathematic function ]]And the function of neurons can be presented as:		[[File:Neuron.png\|thumb\|A single neuron presented as a mathematic function ]]And the function of neurons can be presented as:

← Older revision		Revision as of 14:29, 13 December 2020
Line 2:		Line 2:

	== Introduction ==		== Introduction ==
	RMSProp, root mean square propagation, is an optimization algorithm/method designed for Artificial Neural Network (ANN) training. And it is an unpublished algorithm first proposed in the Coursera course [https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf “Neural Network for Machine Learning”] lecture six by Geoff Hinton. RMSProp lies in the realm of adaptive learning rate methods, which have been growing in popularity in recent years because it is the extension of Stochastic Gradient Descent (SGD) algorithm, momentum method, and the foundation of Adam algorithm. One of the applications of RMSProp is the stochastic technology for mini-batch gradient descent.		RMSProp, root mean square propagation, is an optimization algorithm/method designed for Artificial Neural Network (ANN) training. And it is an unpublished algorithm first proposed in the Coursera course [https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf “Neural Network for Machine Learning”] lecture six by Geoff Hinton<sup>[9]</sup>. RMSProp lies in the realm of adaptive learning rate methods, which have been growing in popularity in recent years because it is the extension of Stochastic Gradient Descent (SGD) algorithm, momentum method, and the foundation of Adam algorithm. One of the applications of RMSProp is the stochastic technology for mini-batch gradient descent.

	==Theory and Methodology==		==Theory and Methodology==
Line 25:		Line 25:
	<math>w_{ij}(t+1) = w_{ij}(t) - \epsilon \frac{\partial E}{\partial w_{ij}}(t)</math>		<math>w_{ij}(t+1) = w_{ij}(t) - \epsilon \frac{\partial E}{\partial w_{ij}}(t)</math>

	The choice of the learning rate <math>\epsilon</math>, which scales the derivative, has an important effect on the time needed until convergence is reached. If it is set too small, too many steps are needed to reach an acceptable solution; on the contrary, a large learning rate will possibly lead to oscillation, preventing the error to fall below a certain value<sup>7</sup>.		The choice of the learning rate <math>\epsilon</math>, which scales the derivative, has an important effect on the time needed until convergence is reached. If it is set too small, too many steps are needed to reach an acceptable solution; on the contrary, a large learning rate will possibly lead to oscillation, preventing the error to fall below a certain value<sup>[7]</sup>.

	In addition, RProp can combine the method with momentum method, to prevent above problem and to accelerate the convergence rate, the equation can rewrite as:		In addition, RProp can combine the method with momentum method, to prevent above problem and to accelerate the convergence rate, the equation can rewrite as:
Line 65:		Line 65:
	<math>w_{2}(t) = w_{2}(t-1) - \frac{- 0.4}{ \sqrt{E_{1}}} \frac{\partial c_{2}}{\partial w_{2}}</math>		<math>w_{2}(t) = w_{2}(t-1) - \frac{- 0.4}{ \sqrt{E_{1}}} \frac{\partial c_{2}}{\partial w_{2}}</math>

	while using programming language to help us to visualize the trajectory of RMSProp algorithm, we can observe that the curve converge to a certain point. For this particular question, minimize solution <math>0 </math> will be obtained where <math>(x_{1}, x_{2}) </math> is <math>(0, 0) </math>. [[File:1 - 2dKCQHh - Long Valley.gif\|thumb\|Visualizing Optimization algorithm comparing convergence with similar algorithm<sup>1</sup>]]		while using programming language to help us to visualize the trajectory of RMSProp algorithm, we can observe that the curve converge to a certain point. For this particular question, minimize solution <math>0 </math> will be obtained where <math>(x_{1}, x_{2}) </math> is <math>(0, 0) </math>. [[File:1 - 2dKCQHh - Long Valley.gif\|thumb\|Visualizing Optimization algorithm comparing convergence with similar algorithm<sup>[1]</sup>]]

	== Applications and Discussion ==		== Applications and Discussion ==
	[[File:2 - pD0hWu5 - Beale's function.gif\|thumb\|Visualizing Optimization algorithm comparing convergence with similar algorithm<sup>1</sup>]]		[[File:2 - pD0hWu5 - Beale's function.gif\|thumb\|Visualizing Optimization algorithm comparing convergence with similar algorithm<sup>[1]</sup>]]
	The applications of RMSprop concentrate on the optimization with complex function like the neural network, or the non-convex optimization problem with adaptive learning rate, and widely used in the stochastic problem. The RMSprop optimizer restricts the oscillations in the vertical direction. Therefore, we can increase the learning rate or the algorithm could take larger steps in the horizontal direction converging to faster the similar approach gradient descent algorithm combine with momentum method.		The applications of RMSprop concentrate on the optimization with complex function like the neural network, or the non-convex optimization problem with adaptive learning rate, and widely used in the stochastic problem. The RMSprop optimizer restricts the oscillations in the vertical direction. Therefore, we can increase the learning rate or the algorithm could take larger steps in the horizontal direction converging to faster the similar approach gradient descent algorithm combine with momentum method.

	In the first visualization scheme, the gradients based optimization algorithm has a different convergence rate. As the visualizations are shown, without scaling based on gradient information algorithms are hard to break the symmetry and converge rapidly. RMSProp has a relative higher converge rate than SGD, Momentum, and NAG, beginning descent faster, but it is slower than Ada-grad, Ada-delta, which are the Adam based algorithm. In conclusion, when handling the large scale/gradients problem, the scale gradients/step sizes like Ada-delta, Ada-grad, and RMSProp perform better with high stability.		In the first visualization scheme, the gradients based optimization algorithm has a different convergence rate. As the visualizations are shown, without scaling based on gradient information algorithms are hard to break the symmetry and converge rapidly. RMSProp has a relative higher converge rate than SGD, Momentum, and NAG, beginning descent faster, but it is slower than Ada-grad, Ada-delta, which are the Adam based algorithm. In conclusion, when handling the large scale/gradients problem, the scale gradients/step sizes like Ada-delta, Ada-grad, and RMSProp perform better with high stability.

	Ada-grad adaptive learning rate algorithms that look a lot like RMSProp. Ada-grad adds element-wise scaling of the gradient-based on the historical sum of squares in each dimension. This means that we keep a running sum of squared gradients, and then we adapt the learning rate by dividing it by the sum to get the result. Considering the concepts in RMSProp widely used in other machine learning algorithms, we can say that it has high potential to coupled with other methods such as momentum,...etc.		Ada-grad adaptive learning rate algorithms that look a lot like RMSProp. Ada-grad adds element-wise scaling of the gradient-based on the historical sum of squares in each dimension. This means that we keep a running sum of squared gradients, and then we adapt the learning rate by dividing it by the sum to get the result. Considering the concepts in RMSProp widely used in other machine learning algorithms, we can say that it has high potential to coupled with other methods such as momentum,...etc.

	== Conclusion==		== Conclusion==
Line 82:		Line 82:
	1. A. Radford, "[https://imgur.com/a/Hqolp#NKsFHJb Visualizing Optimization Algos (open sourse)".]		1. A. Radford, "[https://imgur.com/a/Hqolp#NKsFHJb Visualizing Optimization Algos (open sourse)".]

	2. R. Yamashita, M Nishio and R KGian, "Convolutional neural networks: an overview and application in radiology", pp. 9:611–629, 2018.[[File:3 - NKsFHJb - Saddle Point.gif\|thumb\|Visualizing Optimization algorithm comparing convergence with similar algorithm<sup>1</sup>]]3. V. Bushave, "Understanding RMSprop — faster neural network learning", 2018.		2. R. Yamashita, M Nishio and R KGian, "Convolutional neural networks: an overview and application in radiology", pp. 9:611–629, 2018.[[File:3 - NKsFHJb - Saddle Point.gif\|thumb\|Visualizing Optimization algorithm comparing convergence with similar algorithm<sup>[1]</sup>]]3. V. Bushave, "Understanding RMSprop — faster neural network learning", 2018.

	4. V. Bushave, "How do we ‘train’ neural networks ?", 2017.		4. V. Bushave, "How do we ‘train’ neural networks ?", 2017.