Adam: Difference between revisions

From Cornell University Computational Optimization Open Textbook - Optimization Wiki
Jump to navigation Jump to search
No edit summary
 
(41 intermediate revisions by 2 users not shown)
Line 1: Line 1:
Authors: Nicholas Kincaid (ChemE 6800 Fall 2020)
Author: Akash Ajagekar (SYSEN 6800 Fall 2021)


== Introduction ==
== Introduction ==
Adam <ref name="adam"> Kingma, Diederik P., and Jimmy Lei Ba. Adam: A Method for Stochastic Optimization. 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, 2015, pp. 1–15.</ref> is a variant of gradient descent that has become widely popular in the machine learning community. Presented in 2015, the Adam algorithm is often recommended as the default algorithm for training neural networks as it has shown improved performance over other variants of gradient descent algorithms for a wide range of problems. Adam's name is derived from adaptive moment estimation because uses estimates of the first and second moments of the gradient to perform updates, which can be seen as incorporating gradient descent with momentum (the first-order moment) and [https://optimization.cbe.cornell.edu/index.php?title=RMSProp RMSProp] algorithm<ref>Tieleman, Tijmen, and Hinton, Geoffrey. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude, COURSERA: Neural Networks for Machine Learning, 2012.</ref> (the second-order moment).
Adam optimizer is the extended version of stochastic gradient descent which could be implemented in various deep learning applications such as computer vision and natural language processing in the future years. Adam was first introduced in 2014. It was first presented at a famous conference for deep learning researchers called [https://www.iclr.cc/archive/www/doku.php%3Fid=iclr2015:main.html ICLR 2015]. It is an optimization algorithm that can be an alternative for the stochastic gradient descent process. The name is derived from adaptive moment estimation. The optimizer is called Adam because uses estimations of the first and second moments of the gradient to adapt the learning rate for each weight of the neural network. The name of the optimizer is Adam; it is not an acronym. Adam is proposed as the most efficient stochastic optimization which only requires first-order gradients where memory requirement is too little. Before Adam, many adaptive optimization techniques were introduced such as AdaGrad, RMSP which have good performance over SGD but in some cases have some disadvantages such as generalizing performance which is worse than that of the SGD in some cases. So, Adam was introduced which is better in terms of generalizing performance. Also in Adam, the hyperparameters have intuitive interpretations and hence required less tuning.<ref>A. Agnes Lydia and , F. Sagayaraj Francis, ''Adagrad - An Optimizer for Stochastic Gradient Descent,'' Department of Computer Science and Engineering, Pondicherry Engineering College, May 2019.</ref>  Adam performs well. But in some cases, researchers have observed Adam doesn't converge to the optimal solution, SGD optimizer does instead. In a diverse set of deep learning tasks sometimes Adam optimizers have low generalizing performance. According to the author Nitish Shirish Keskar and Richard Socher, switching to SGD in some cases show better generalizing performance than Adam alone.<ref>Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: ''Divide the gradient by a running average of its recent magnitude.'' COURSERA: neural networks for machine learning, 4(2):26–31, 2012.</ref>
 
== Theory ==
In Adam instead of adapting learning rates based on the average first moment as in RMSP, Adam makes use of the average of the second moments of the gradients. Adam. This algorithm calculates the exponential moving average of gradients and square gradients. And the parameters of β1 and β2 are used to control the decay rates of these moving averages. Adam is a combination of two gradient descent methods, Momentum, and RMSP which are explained below.
 
=== Momentum: ===
This is an optimization algorithm that takes into consideration the 'exponentially weighted average' and accelerates the gradient descent. It is an extension of the gradient descent optimization algorithm.<ref>John Pomerat, Aviv Segev, and Rituparna Datta, ''On Neural Network Activation Functions and Optimizers in Relation to Polynomial Regression'', 2019 IEEE International Conference on Big Data (Big Data).</ref>
 
The Momentum algorithm is solved in two parts. The first is to calculate the position change and the second is to update the old position. The change in the position is given by;
 
'''''<math>update=\alpha*m_t</math>'''''
 
The new position or weights at time t is given by;
 
'''''<math>w_t+1=w_t-update</math>'''''
 
Here in the above equation '''''<math>\alpha(Step Size)</math>''''' is the Hyperparameter which controls the movement in the search space which is also called as learning rate. And, '''''<math>f'(x)</math>''''' is the derivative function or aggregate of gradients at time t.
 
where;
 
'''<math>m_t = \beta_1*m_t + (1-\beta_1)* (\delta L/\delta w_t)</math>'''
 
In the above equations '''<math>m_t</math>''' and '''<math>m_t-1</math>''' are aggregate of gradients at time t and aggregate of gradient at time t-1.
 
Momentum has the effect of dampening down the change in the gradient and, in turn, the step size with each new point in the search space.
 
=== Root Mean Square Propagation (RMSP): ===
RMSP is an adaptive optimization algorithm that is an improved version of AdaGrad. RMSP tackles to solve the problems of momentum and works well in online settings. <ref>Zijun Zhang, ''Improved Adam Optimizer for Deep Neural Networks'', ©2018 IEEE.</ref> In AdaGrad we take the cumulative summation of squared gradients but, in RMSP we take the 'exponential average'. 
 
It is given by,
 
'''<math>w_t+1=w_t-(\alpha_t/\sqrt(v_t)+e)*(\delta L/\delta w_t)</math>'''
 
where,
 
'''<math>v_t = \beta*v_t + (1-\beta)* (\delta L/\delta w_t )^2</math>'''
 
Here,
 
Aggregate of gradient at t = '''<math>m_t</math>'''
 
Aggregate of gradient at t - 1 = '''<math>m_t-1</math>'''
 
Weights at time t = '''<math>w_t</math>'''
 
Weights at time t + 1 = '''<math>w_t+1</math>'''
 
'''<math>\alpha_t</math>''' = learning rate(Hyperparameter)
 
∂L = derivative of loss function
 
∂w_t = derivative of weights at t
 
β = Average parameter
 
'''<math>e</math>''' = constant
 
But as we know these two optimizers explained below have some problems such as generalizing performance. The article <ref>Wendyam Eric Lionel Ilboudo, Taisuke Kobayashi, Kenji Sugimoto, ''TAdam: A Robust Stochastic Gradient Optimizer'', [cs.LG] 3 Mar 2020.</ref> tells us that Adam takes over the attributes of the above two optimizers and builds upon them to give more optimized gradient descent.     
 
== Algorithm ==
Taking the equations used in the above two optimizers;
 
'''<math>m_t = \beta_1*m_t + (1-\beta_1)* (\delta L/\delta w_t)</math>'''
 
and
 
'''''<math>v_t = \beta_2*v_t + (1-\beta_2)* (\delta L/\delta w_t )^2</math>'''''
 
Initially, both ''mt'' and ''vt'' are set to 0. Both tend to be more biased towards 0 as β1 and β2 are equal to 1. By computing bias-corrected '''<math>\hat{m_t}</math>''' and '''''<math>\hat{v_t}</math>''''', this problem is corrected by the Adam optimizer. The equations are as follows;
 
<math>\hat{m_t}=m_t\div(1-\beta_1^t )</math>
 
'''''<math>\hat{v_t}=v_t\div(1-\beta_2^t )</math>'''''
 
Now as we are getting used to gradient descent after every iteration and hence it remains controlled and unbiased. Now substitute the new parameters in place of the old ones. We get;
 
'''''<math>w_t=w(t-1)-\alpha*(\hat{m_t}/\sqrt(\hat{v_t})+e)</math>'''''
 
The pseudocode for the Adam optimizer is given below;
 
'''while''' ''w(t) not converged'' '''do'''
 
<math>t = t + 1.</math>
 
'''<math>m_t = \beta_1*m_t + (1-\beta_1)* (\delta L/\delta w_t)</math>'''
 
'''<math>v_t = \beta_2*v_t + (1-\beta_2)* (\delta L/\delta w_t )^2</math>'''
 
<math>\hat{m_t}=m_t\div(1-\beta_1^t )</math>
 
<math>\hat{v_t}=v_t\div(1-\beta_2^t )</math>
 
<math>w_t=w(t-1)-\alpha*(\hat{m_t}/\sqrt(\hat{v_t})+e)</math>
 
'''end'''
 
'''return''' ''w(t)''
 
== Performance ==
Adam optimizer gives much higher performance results than the other optimizers and outperforms by a big margin for a better-optimized gradient. The diagram below is one example of a performance comparison of all the optimizers. 
 
  [[File:Adam training.png|thumb|Comparison of optimizers used for the optimization training of a multilayer neural network on MNIST images. Source- Google]]   
 
 
 


== Background ==
=== Batch Gradient Descent ===
In standard batch gradient descent, the parameters, <math>\theta</math>, of the objective function <math>f(\theta)</math>, are updated based on the gradient of <math>f</math> with respect to
<math>\theta</math> for the entire training dataset, as
<math> g_t =\nabla_{\theta_{t-1}} f \big(\theta_{t-1} \big) </math> <br/>
<math> \theta_t = \theta_{t-1} - \alpha g_t , </math> <br/>


where <math>\alpha</math> is defined as the learning rate and is a hyper-parameter of the optimization algorithm, and <math>t</math> is the iteration number. Key challenges of the standard gradient descent method are the tendency to get stuck in local minima and/or saddle points of the objective function, as well as choosing a proper learning rate, <math>\alpha</math>, which can lead to poor convergence.<ref>Ruder, Sebastian. An Overview of Gradient Descent Optimization Algorithms, 2016, pp. 1–14, http://arxiv.org/abs/1609.04747.</ref>


=== Stochastic Gradient Descent ===
Another variant of gradient descent is [https://optimization.cbe.cornell.edu/index.php?title=Stochastic_gradient_descent stochastic gradient descent (SGD)], the gradient is computed and parameters are updated as in equation 1, but for each training sample in the training set.
=== Mini-Batch Gradient Descent ===
In between batch gradient descent and stochastic gradient descent, mini-batch gradient descent computes parameters updates on the gradient computed from a subset of the training set, where the size of the subset is often referred to as the batch size.


== Adam Algorithm ==
The Adam algorithm first computes the gradient, <math>g_t</math> of the objective function with respect to the parameters <math>\theta</math>, but then computes and stores first and second order moments of the gradient, <math>m_t</math> and <math>v_t</math>
respectively, as


<math> m_t = \beta_1 \cdot m_{t-1} + (1-\beta_1) \cdot g_t </math> <br/>
<math> v_t = \beta_2 \cdot v_{t-1} + (1-\beta_2) \cdot g_t^2, </math> <br/>


where <math>\beta_1</math> and <math>\beta_2</math> are hyper-parameters that are <math>\in [0,1]</math>. These parameters can seen as exponential decay rates of the estimated moments, as the previous value is successively multiplied by the value less than 1 in each iteration. The authors of the original paper suggest values <math>\beta_1 = 0.9</math> and <math>\beta_2 = 0.999</math>. In the current notation, the first iteration of the algorithm is at <math>t=1</math> and both, <math>m_0</math> and <math>v_0</math> are initialized to zero. Since both moments are initialized to zero, at early time steps, these values are biased towards zero. To counter this, the authors proposed a corrected update to <math>m_t</math> and <math>v_t</math> as


<math> \hat{m}_t = m_t / (1-\beta_1 ^t) </math> <br/>
<math> \hat{v}_t = v_t / (1-\beta_2 ^t). </math> <br/>
Finally, the parameter update is computed as


<math> \theta_t = \theta_{t-1} - \alpha \cdot \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon), </math> <br/>


where <math>\epsilon</math> is a small constant for stability. The authors recommend a value of <math>\epsilon=10^{-8}</math>.


== Numerical Example ==


[[File:Contour.png|thumb|Contour plot of the loss function showing the trajectory of Adam algorithm from the initial point]]


[[File:Model fit .png|thumb|Plot showing original data points and resulting model fit from the Adam algorithm]]


== Numerical Example ==


To illustrate how updates occur in the Adam algorithm, consider a linear, least-squares regression problem formulation. The table below shows a sample data-set of student exam grades and the number of hours spent studying for the exam. The goal of this example will be to generate a linear model to predict exam grades as a function of time spent studying.
Let's see an example of Adam optimizer. A sample dataset is shown below which is the weight and height of a couple of people. We have to predict the height of a person based on the given weight.


{| class="wikitable"
{| class="wikitable"
|-
|-
| Hours Studying || 9.0 || 4.9 || 1.6 || 1.9 || 7.9 || 2.0 || 11.5 || 3.9 || 1.1 || 1.6 || 5.1 || 8.2 || 7.3 || 10.4 || 11.2
| Weight || 60 || 76 || 85 || 76 || 50 || 55 || 100 || 105 || 45 || 78 || 57 || 91 || 69 || 74 || 112
|-
|-
| Exam Grad || 88.0 || 72.3 || 66.5 || 65.1 || 79.5 || 60.8 || 94.3, || 66.7 || 65.4 || 63.8 || 68.4 || 82.5 || 75.9 || 87.8 || 85.2
| Height || 76 || 72.3 || 88 || 60 || 79 || 47 || 67 || 66 || 65 || 61 || 68 || 56 || 75 || 57 || 76
|}
|}


The hypothesized model function will be
The hypothesis function is;


<math>f_\theta(x) = \theta_0 + \theta_1 x.</math>
<math>f_\theta(x) = \theta_0 + \theta_1 x.</math>


The cost function is defined as
The cost function is;


<math> J({\theta}) =  \frac{1}{2}\sum_i^n \big(f_\theta(x_i) - y_i \big)^2, </math>
<math> J({\theta}) =  \frac{1}{2}\sum_i^n \big(f_\theta(x_i) - y_i \big)^2 </math>


Where the <math>1/2</math> coefficient is used only to make the derivatives cleaner. The optimization problem can then be formulated as trying to find the values of <math>\theta</math> that minimize the squared residuals of <math>f_\theta(x)</math> and <math>y</math>.
The optimization problem is defined as, we must find the values of theta which help to minimize the objective function mentioned below;


<math> \mathrm{argmin}_{\theta} \quad \frac{1}{n}\sum_{i}^n \big(f_\theta(x_i) - y_i \big) ^2 </math>
<math> \mathrm{argmin}_{\theta} \quad \frac{1}{n}\sum_{i}^n \big(f_\theta(x_i) - y_i \big) ^2 </math>  


For simplicity, parameters will be updated after every data point i.e. a batch size of 1. For a single data point the derivatives of the cost function with respect to <math>\theta_0</math> and <math>\theta_1</math> are
The cost function with respect to the weights <math>\theta_0</math> and <math>\theta_1</math> are;


<math> \frac{\partial J(\theta)}{\partial \theta_0} = \big(f_\theta(x) - y \big)  </math><br/>
<math> \frac{\partial J(\theta)}{\partial \theta_0} = \big(f_\theta(x) - y \big)  </math><br/>
<math> \frac{\partial J(\theta)}{\partial \theta_1} = \big(f_\theta(x) - y \big) x </math>
<math> \frac{\partial J(\theta)}{\partial \theta_1} = \big(f_\theta(x) - y \big) x </math>


The initial values of <math>{\theta}</math> will be set to [50, 1] and  The learning rate, <math>\alpha</math>, is set to 0.1 and the suggested parameters for <math>\beta_1</math>, <math>\beta_2</math>, and <math>\epsilon</math> are used. With the first data sample of <math> (x,y)=[8.98, 88.01]</math>, the computed gradients are


<math> \frac{\partial J(\theta)}{\partial \theta_0} = \big((50 + 1\cdot 9 - 88.01 \big) = -29.0  </math><br/>
The initial values of <math>{\theta}</math> will be set to [10, 1] and the learning rate <math>\alpha</math>, is set to 0.01 and setting the parameters <math>\beta_1</math>, <math>\beta_2</math>, and <math>e</math> as 0.94, 0.9878 and 10^-8 respectively.
<math> \frac{\partial J(\theta)}{\partial \theta_1} = \big((50 + 1\cdot 9 - 88.01 \big)\cdot 9.0 = -261  </math><br/>


With <math>m_0</math> and <math>v_0</math> being initialized to zero, the calculations of <math>m_1</math> and <math>v_1</math> are
'''Iteration 1:'''


<math> m_1 = 0.9 \cdot 0 + (1-0.9) \cdot \begin{bmatrix} -29\\ -261 \end{bmatrix} = \begin{bmatrix} -2.9\\ -26.1\end{bmatrix} </math> <br/>
Starting from the first data sample the gradients are;
<math> v_1 = 0.999\cdot 0 + (1-0.999) \cdot \begin{bmatrix} -29^2\\-261^2 \end{bmatrix} = \begin{bmatrix} 0.84\\ 68.2\end{bmatrix} , </math> <br/>


The bias-corrected terms are computed as
<math> \frac{\partial J(\theta)}{\partial \theta_0} = \big((10 + 1\cdot 60 - 76 \big) = -6  </math><br />
<math> \frac{\partial J(\theta)}{\partial \theta_1} = \big((10 + 1\cdot 60 - 76 \big)\cdot 60 = -360  </math>


<math> \hat{m}_1 = \begin{bmatrix} -2.9\\ -26.1\end{bmatrix} \frac{1}{ (1-0.9^1)} =  \begin{bmatrix} -29.0\\-261.1\end{bmatrix}</math> <br/>
<br />Here <math>m_0</math> and <math>v_0</math> are initially zero, <math>m_1</math> and <math>v_1</math> are calculated as
<math> \hat{v}_1 = \begin{bmatrix} 0.84\\ 68.2\end{bmatrix}  \frac{1} {(1-0.999^1)} = \begin{bmatrix} 851.5\\68168\end{bmatrix}. </math> <br/>


Finally, the parameter update is
<math> m_1 = 0.94 \cdot 0 + (1-0.94) \cdot \begin{bmatrix} -6\\ -360 \end{bmatrix} = \begin{bmatrix} -0.36\\ -21.6\end{bmatrix} </math> <br />
<math> v_1 = 0.9878\cdot 0 + (1-0.9878) \cdot \begin{bmatrix} -6^2\\-360^2 \end{bmatrix} = \begin{bmatrix} 0.4392\\ 1581.12\end{bmatrix}  </math> <br />


<math> \theta_0 = 50 - 0.1 \cdot -29 / (\sqrt{851.5} + 10^{-8}) = 50.1 </math> <br/>
The new bias-corrected values of <math>m_1</math> and <math>v_1</math> are;
<math> \theta_1 = 1 - 0.1 \cdot -261 / (\sqrt{68168} + 10^{-8}) = 1.1 </math> <br/>


This procedure is repeated until the parameters have converged, giving <math>\theta</math> values of <math>[58.98, 2.72]</math>. The figures to the right show the trajectory of the Adam algorithm over a contour plot of the objective function and the resulting model fit. It should be noted that the stochastic gradient descent algorithm with a learning rate of 0.1 diverges and with a rate of 0.01, SGD oscillates around the global minimum due to the large magnitudes of the gradient in the <math>\theta_1</math> direction.
<math> \hat{m}_1 = \begin{bmatrix} -0.36\\ -21.6\end{bmatrix} \frac{1}{ (1-0.94^1)} =  \begin{bmatrix} -6\\-360\end{bmatrix}</math> <br />
<math> \hat{v}_1 = \begin{bmatrix} 0.4392\\ 1581.12\end{bmatrix}  \frac{1} {(1-0.9878^1)} = \begin{bmatrix} 36\\129600\end{bmatrix} </math> <br />


Finally, the weight update is;


== Applications ==
<math> \theta_0 = 10 - 0.01 \cdot -6 / (\sqrt{36} + 10^{-8}) = 10.01 </math> <br/>
[[File:Adam training.png|thumb|Comparison of training a multilayer neural network on MNIST images for different gradient descent algorithms published in the original Adam paper (Kingma, 2015)<ref name="adam" />.]]
<math> \theta_1 = 1 - 0.01 \cdot -360 / (\sqrt{129600} + 10^{-8}) = 1.01 </math> <br/>


The Adam optimization algorithm has been widely used in machine learning applications to train model parameters. When used with backpropagation, the Adam algorithm has been shown to be a very robust and efficient method for training artificial neural networks and is capable of working well with a variety of structures and applications. In their original paper, the authors present three different training examples, logistic regression, multi-layer neural networks for classification of MNIST images, and a convolutional neural network (CNN). The training results from the original Adam paper showing the objective function cost vs. the iteration over the entire data set for the multi-layer neural network is shown to the right.
The procedure is repeated until the parameters are converged giving values for <math> \theta </math> as [11.39,2].  


== Variants of Adam ==
=== AdaMax ===
AdaMax<ref name="adam" /> is a variant of the Adam algorithm proposed in the original Adam paper that uses an exponentially weighted infinity norm instead of the second-order moment estimate. The weighted infinity norm updated <math>u_t</math>, is computed as


<math> u_t = \max(\beta_2 \cdot u_{t-1}, |g_t|). </math>


The parameter update then becomes
== Applications ==
 
The Adam optimization algorithm is the replacement optimization algorithm for SGD for training DNN. According to the author John Pomerat, Aviv Segev, and Rituparna Datta, Adam combines the best properties of the AdaGrad and RMSP algorithms to provide an optimization algorithm that can handle sparse gradients on noisy problems.<ref>Diederik P. Kingma, Jimmy Lei Ba, ''Adam: A Method For Stochastic Optimization'', Published as a conference paper at ICLR 2015.
<math> \theta_t = \theta_{t-1} - (\alpha / (1-\beta_1^t)) \cdot m_t / u_t. </math>
</ref> Research has shown that Adam has demonstrated superior experimental performance over all the other optimizers such as AdaGrad, SGD, RMSP, etc. Further research is going on Adaptive optimizers for Federated Learning and their performances are being compared. Federated Learning is a privacy-preserving technique that is an alternative for Machine Learning where data training is done on the device itself without sharing it with the cloud server. A study has been done by the author Aatila Mustapha, Lachgar Mohamed, and Kartit Ali in which different optimizers are compared, and then based on the results, an optimizer is selected that can be used in the future for big data sets.<ref>AATILA Mustapha, LACHGAR Mohamed and KARTIT Ali, ''Comparative study of optimization techniques in deep learning: Application in the ophthalmology field,'' The International Conference on Mathematics & Data Science (ICMDS) 2020.</ref> Here AdGrad performed well but, Adam also showed promising results that tell us that Adam doesn't perform well in all cases.  
 
=== Nadam ===
The Nadam algorithm<ref>Dozat, Timothy. Incorporating Nesterov Momentum into Adam. ICLR Workshop, no. 1, 2016, pp. 2013–16. </ref> was proposed in 2016 and incorporates the Nesterov Accelerate Gradient (NAG)<ref>Nesterov, Yuri. A method of solving a convex programming problem with convergence rate O(1/k^2). In Soviet Mathematics Doklady, 1983, pp. 372-376.</ref>, a popular momentum like SGD variation, into the first-order moment term.  


== Conclusion ==
== Conclusion ==
Adam is a variant of the gradient descent algorithm that has been widely adopted in the machine learning community. Adam can be seen as the combination of two other variants of gradient descent, SGD with momentum and RMSProp. Adam uses estimations of the first and second-order moments of the gradient to adapt the parameter update. These moment estimations are computed via moving averages,<math>m_t</math> and <math>v_t</math>, of the gradient and the squared gradient respectfully. In a variety of neural network training applications, Adam has shown increased convergence and robustness over other gradient descent algorithms and is often recommended as the default optimizer for training.<ref> "Neural Networks Part 3: Learning and Evaluation," CS231n: Convolutional Neural Networks for Visual Recognition, Stanford Unversity, 2020</ref>
Research has shown that Adam has demonstrated superior experimental performance over all the other optimizers such as AdaGrad, SGD, RMSP, etc in DNN.<ref>Duchi, J., Hazan, E., & Singer, Y. (2011). ''Adaptive subgradient methods for online learning and stochastic optimization''. Journal of machine learning research.</ref> This type of optimizer is useful for large datasets. As we know this optimizer is a combination of Momentum and RMSP optimization algorithms. This method is pretty much straightforward, easy to use, and requires less memory. Also, we have shown an example where all the optimizers are compared, and the results are shown with the help of the graph. Overall, it is a robust optimizer and well suited for non-convex optimization problems in the field of Machine Learning and Deep Learning. <ref>Ameer Hamza Khan, Xinwei Cao, Shuai Li, Vasilios N. Katsikis, and Liefa Liao, ''Bas-Adam: An Adam Based Approach to Improve the Performance of Beetle Antennae Search Optimizer'', IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 7, NO. 2, MARCH 2020.</ref>  


== References ==
== References ==
<references/>
<references/>

Latest revision as of 14:19, 16 December 2021

Author: Akash Ajagekar (SYSEN 6800 Fall 2021)

Introduction

Adam optimizer is the extended version of stochastic gradient descent which could be implemented in various deep learning applications such as computer vision and natural language processing in the future years. Adam was first introduced in 2014. It was first presented at a famous conference for deep learning researchers called ICLR 2015. It is an optimization algorithm that can be an alternative for the stochastic gradient descent process. The name is derived from adaptive moment estimation. The optimizer is called Adam because uses estimations of the first and second moments of the gradient to adapt the learning rate for each weight of the neural network. The name of the optimizer is Adam; it is not an acronym. Adam is proposed as the most efficient stochastic optimization which only requires first-order gradients where memory requirement is too little. Before Adam, many adaptive optimization techniques were introduced such as AdaGrad, RMSP which have good performance over SGD but in some cases have some disadvantages such as generalizing performance which is worse than that of the SGD in some cases. So, Adam was introduced which is better in terms of generalizing performance. Also in Adam, the hyperparameters have intuitive interpretations and hence required less tuning.[1]  Adam performs well. But in some cases, researchers have observed Adam doesn't converge to the optimal solution, SGD optimizer does instead. In a diverse set of deep learning tasks sometimes Adam optimizers have low generalizing performance. According to the author Nitish Shirish Keskar and Richard Socher, switching to SGD in some cases show better generalizing performance than Adam alone.[2]

Theory

In Adam instead of adapting learning rates based on the average first moment as in RMSP, Adam makes use of the average of the second moments of the gradients. Adam. This algorithm calculates the exponential moving average of gradients and square gradients. And the parameters of β1 and β2 are used to control the decay rates of these moving averages. Adam is a combination of two gradient descent methods, Momentum, and RMSP which are explained below.

Momentum:

This is an optimization algorithm that takes into consideration the 'exponentially weighted average' and accelerates the gradient descent. It is an extension of the gradient descent optimization algorithm.[3]

The Momentum algorithm is solved in two parts. The first is to calculate the position change and the second is to update the old position. The change in the position is given by;

The new position or weights at time t is given by;

Here in the above equation is the Hyperparameter which controls the movement in the search space which is also called as learning rate. And, is the derivative function or aggregate of gradients at time t.

where;

In the above equations and are aggregate of gradients at time t and aggregate of gradient at time t-1.

Momentum has the effect of dampening down the change in the gradient and, in turn, the step size with each new point in the search space.

Root Mean Square Propagation (RMSP):

RMSP is an adaptive optimization algorithm that is an improved version of AdaGrad. RMSP tackles to solve the problems of momentum and works well in online settings. [4] In AdaGrad we take the cumulative summation of squared gradients but, in RMSP we take the 'exponential average'.

It is given by,

where,

Here,

Aggregate of gradient at t =

Aggregate of gradient at t - 1 =

Weights at time t =

Weights at time t + 1 =

= learning rate(Hyperparameter)

∂L = derivative of loss function

∂w_t = derivative of weights at t

β = Average parameter

= constant

But as we know these two optimizers explained below have some problems such as generalizing performance. The article [5] tells us that Adam takes over the attributes of the above two optimizers and builds upon them to give more optimized gradient descent.

Algorithm

Taking the equations used in the above two optimizers;

and

Initially, both mt and vt are set to 0. Both tend to be more biased towards 0 as β1 and β2 are equal to 1. By computing bias-corrected and , this problem is corrected by the Adam optimizer. The equations are as follows;

Now as we are getting used to gradient descent after every iteration and hence it remains controlled and unbiased. Now substitute the new parameters in place of the old ones. We get;

The pseudocode for the Adam optimizer is given below;

while w(t) not converged do

end

return w(t)

Performance

Adam optimizer gives much higher performance results than the other optimizers and outperforms by a big margin for a better-optimized gradient. The diagram below is one example of a performance comparison of all the optimizers.

Comparison of optimizers used for the optimization training of a multilayer neural network on MNIST images. Source- Google









Numerical Example

Let's see an example of Adam optimizer. A sample dataset is shown below which is the weight and height of a couple of people. We have to predict the height of a person based on the given weight.

Weight 60 76 85 76 50 55 100 105 45 78 57 91 69 74 112
Height 76 72.3 88 60 79 47 67 66 65 61 68 56 75 57 76

The hypothesis function is;

The cost function is;

The optimization problem is defined as, we must find the values of theta which help to minimize the objective function mentioned below;

The cost function with respect to the weights and are;



The initial values of will be set to [10, 1] and the learning rate , is set to 0.01 and setting the parameters , , and as 0.94, 0.9878 and 10^-8 respectively.

Iteration 1:

Starting from the first data sample the gradients are;



Here and are initially zero, and are calculated as



The new bias-corrected values of and are;



Finally, the weight update is;



The procedure is repeated until the parameters are converged giving values for as [11.39,2].


Applications

The Adam optimization algorithm is the replacement optimization algorithm for SGD for training DNN. According to the author John Pomerat, Aviv Segev, and Rituparna Datta, Adam combines the best properties of the AdaGrad and RMSP algorithms to provide an optimization algorithm that can handle sparse gradients on noisy problems.[6] Research has shown that Adam has demonstrated superior experimental performance over all the other optimizers such as AdaGrad, SGD, RMSP, etc. Further research is going on Adaptive optimizers for Federated Learning and their performances are being compared. Federated Learning is a privacy-preserving technique that is an alternative for Machine Learning where data training is done on the device itself without sharing it with the cloud server. A study has been done by the author Aatila Mustapha, Lachgar Mohamed, and Kartit Ali in which different optimizers are compared, and then based on the results, an optimizer is selected that can be used in the future for big data sets.[7] Here AdGrad performed well but, Adam also showed promising results that tell us that Adam doesn't perform well in all cases.

Conclusion

Research has shown that Adam has demonstrated superior experimental performance over all the other optimizers such as AdaGrad, SGD, RMSP, etc in DNN.[8] This type of optimizer is useful for large datasets. As we know this optimizer is a combination of Momentum and RMSP optimization algorithms. This method is pretty much straightforward, easy to use, and requires less memory. Also, we have shown an example where all the optimizers are compared, and the results are shown with the help of the graph. Overall, it is a robust optimizer and well suited for non-convex optimization problems in the field of Machine Learning and Deep Learning. [9]

References

  1. A. Agnes Lydia and , F. Sagayaraj Francis, Adagrad - An Optimizer for Stochastic Gradient Descent, Department of Computer Science and Engineering, Pondicherry Engineering College, May 2019.
  2. Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: neural networks for machine learning, 4(2):26–31, 2012.
  3. John Pomerat, Aviv Segev, and Rituparna Datta, On Neural Network Activation Functions and Optimizers in Relation to Polynomial Regression, 2019 IEEE International Conference on Big Data (Big Data).
  4. Zijun Zhang, Improved Adam Optimizer for Deep Neural Networks, ©2018 IEEE.
  5. Wendyam Eric Lionel Ilboudo, Taisuke Kobayashi, Kenji Sugimoto, TAdam: A Robust Stochastic Gradient Optimizer, [cs.LG] 3 Mar 2020.
  6. Diederik P. Kingma, Jimmy Lei Ba, Adam: A Method For Stochastic Optimization, Published as a conference paper at ICLR 2015.
  7. AATILA Mustapha, LACHGAR Mohamed and KARTIT Ali, Comparative study of optimization techniques in deep learning: Application in the ophthalmology field, The International Conference on Mathematics & Data Science (ICMDS) 2020.
  8. Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research.
  9. Ameer Hamza Khan, Xinwei Cao, Shuai Li, Vasilios N. Katsikis, and Liefa Liao, Bas-Adam: An Adam Based Approach to Improve the Performance of Beetle Antennae Search Optimizer, IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 7, NO. 2, MARCH 2020.