# Difference between revisions of "Adam"

Author: Akash Ajagekar (SYSEN 6800 Fall 2021)

## Theory

In Adam instead of adapting learning rates based on the average first moment as in RMSP, Adam makes use of the average of the second moments of the gradients. Adam. This algorithm calculates the exponential moving average of gradients and square gradients. And the parameters of β1 and β2 are used to control the decay rates of these moving averages. Adam is a combination of two gradient descent methods, Momentum, and RMSP which are explained below.

### Momentum:

This is an optimization algorithm that takes into consideration the 'exponentially weighted average' and accelerates the gradient descent. It is an extension of the gradient descent optimization algorithm.[3]

The Momentum algorithm is solved in two parts. The first is to calculate the position change and the second is to update the old position. The change in the position is given by;

${\displaystyle update=\alpha *m_{t}}$

The new position or weights at time t is given by;

${\displaystyle w_{t}+1=w_{t}-update}$

Here in the above equation ${\displaystyle \alpha (StepSize)}$ is the Hyperparameter which controls the movement in the search space which is also called as learning rate. And, ${\displaystyle f'(x)}$ is the derivative function or aggregate of gradients at time t.

where;

${\displaystyle m_{t}=\beta _{1}*m_{t}+(1-\beta _{1})*(\delta L/\delta w_{t})}$

In the above equations ${\displaystyle m_{t}}$ and ${\displaystyle m_{t}-1}$ are aggregate of gradients at time t and aggregate of gradient at time t-1.

Momentum has the effect of dampening down the change in the gradient and, in turn, the step size with each new point in the search space.

### Root Mean Square Propagation (RMSP):

RMSP is an adaptive optimization algorithm that is an improved version of AdaGrad. RMSP tackles to solve the problems of momentum and works well in online settings. [4] In AdaGrad we take the cumulative summation of squared gradients but, in RMSP we take the 'exponential average'.

It is given by,

${\displaystyle w_{t}+1=w_{t}-(\alpha _{t}/{\sqrt {(}}v_{t})+e)*(\delta L/\delta w_{t})}$

where,

${\displaystyle v_{t}=\beta *v_{t}+(1-\beta )*(\delta L/\delta w_{t})^{2}}$

Here,

Aggregate of gradient at t = ${\displaystyle m_{t}}$

Aggregate of gradient at t - 1 = ${\displaystyle m_{t}-1}$

Weights at time t = ${\displaystyle w_{t}}$

Weights at time t + 1 = ${\displaystyle w_{t}+1}$

${\displaystyle \alpha _{t}}$ = learning rate(Hyperparameter)

∂L = derivative of loss function

∂w_t = derivative of weights at t

β = Average parameter

${\displaystyle e}$ = constant

But as we know these two optimizers explained below have some problems such as generalizing performance. The article [5] tells us that Adam takes over the attributes of the above two optimizers and builds upon them to give more optimized gradient descent.

## Algorithm

Taking the equations used in the above two optimizers;

${\displaystyle m_{t}=\beta _{1}*m_{t}+(1-\beta _{1})*(\delta L/\delta w_{t})}$

and

${\displaystyle v_{t}=\beta _{2}*v_{t}+(1-\beta _{2})*(\delta L/\delta w_{t})^{2}}$

Initially, both mt and vt are set to 0. Both tend to be more biased towards 0 as β1 and β2 are equal to 1. By computing bias-corrected ${\displaystyle {\hat {m_{t}}}}$ and ${\displaystyle {\hat {v_{t}}}}$, this problem is corrected by the Adam optimizer. The equations are as follows;

${\displaystyle {\hat {m_{t}}}=m_{t}\div (1-\beta _{1}^{t})}$

${\displaystyle {\hat {v_{t}}}=v_{t}\div (1-\beta _{2}^{t})}$

Now as we are getting used to gradient descent after every iteration and hence it remains controlled and unbiased. Now substitute the new parameters in place of the old ones. We get;

${\displaystyle w_{t}=w(t-1)-\alpha *({\hat {m_{t}}}/{\sqrt {(}}{\hat {v_{t}}})+e)}$

The pseudocode for the Adam optimizer is given below;

while w(t) not converged do

${\displaystyle t=t+1.}$

${\displaystyle m_{t}=\beta _{1}*m_{t}+(1-\beta _{1})*(\delta L/\delta w_{t})}$

${\displaystyle v_{t}=\beta _{2}*v_{t}+(1-\beta _{2})*(\delta L/\delta w_{t})^{2}}$

${\displaystyle {\hat {m_{t}}}=m_{t}\div (1-\beta _{1}^{t})}$

${\displaystyle {\hat {v_{t}}}=v_{t}\div (1-\beta _{2}^{t})}$

${\displaystyle w_{t}=w(t-1)-\alpha *({\hat {m_{t}}}/{\sqrt {(}}{\hat {v_{t}}})+e)}$

end

return w(t)

## Performance

Adam optimizer gives much higher performance results than the other optimizers and outperforms by a big margin for a better-optimized gradient. The diagram below is one example of a performance comparison of all the optimizers.

Comparison of optimizers used for the optimization training of a multilayer neural network on MNIST images. Source- Google

## Numerical Example

Let's see an example of Adam optimizer. A sample dataset is shown below which is the weight and height of a couple of people. We have to predict the height of a person based on the given weight.

 Weight 60 76 85 76 50 55 100 105 45 78 57 91 69 74 112 Height 76 72.3 88 60 79 47 67 66 65 61 68 56 75 57 76

The hypothesis function is;

${\displaystyle f_{\theta }(x)=\theta _{0}+\theta _{1}x.}$

The cost function is;

${\displaystyle J({\theta })={\frac {1}{2}}\sum _{i}^{n}{\big (}f_{\theta }(x_{i})-y_{i}{\big )}^{2}}$

The optimization problem is defined as, we must find the values of theta which help to minimize the objective function mentioned below;

${\displaystyle \mathrm {argmin} _{\theta }\quad {\frac {1}{n}}\sum _{i}^{n}{\big (}f_{\theta }(x_{i})-y_{i}{\big )}^{2}}$

The cost function with respect to the weights ${\displaystyle \theta _{0}}$ and ${\displaystyle \theta _{1}}$ are;

${\displaystyle {\frac {\partial J(\theta )}{\partial \theta _{0}}}={\big (}f_{\theta }(x)-y{\big )}}$
${\displaystyle {\frac {\partial J(\theta )}{\partial \theta _{1}}}={\big (}f_{\theta }(x)-y{\big )}x}$

The initial values of ${\displaystyle {\theta }}$ will be set to [10, 1] and the learning rate ${\displaystyle \alpha }$, is set to 0.01 and setting the parameters ${\displaystyle \beta _{1}}$, ${\displaystyle \beta _{2}}$, and ${\displaystyle e}$ as 0.94, 0.9878 and 10^-8 respectively.

Iteration 1:

Starting from the first data sample the gradients are;

${\displaystyle {\frac {\partial J(\theta )}{\partial \theta _{0}}}={\big (}(10+1\cdot 60-76{\big )}=-6}$
${\displaystyle {\frac {\partial J(\theta )}{\partial \theta _{1}}}={\big (}(10+1\cdot 60-76{\big )}\cdot 60=-360}$

Here ${\displaystyle m_{0}}$ and ${\displaystyle v_{0}}$ are initially zero, ${\displaystyle m_{1}}$ and ${\displaystyle v_{1}}$ are calculated as

${\displaystyle m_{1}=0.94\cdot 0+(1-0.94)\cdot {\begin{bmatrix}-6\\-360\end{bmatrix}}={\begin{bmatrix}-0.36\\-21.6\end{bmatrix}}}$
${\displaystyle v_{1}=0.9878\cdot 0+(1-0.9878)\cdot {\begin{bmatrix}-6^{2}\\-360^{2}\end{bmatrix}}={\begin{bmatrix}0.4392\\1581.12\end{bmatrix}}}$

The new bias-corrected values of ${\displaystyle m_{1}}$ and ${\displaystyle v_{1}}$ are;

${\displaystyle {\hat {m}}_{1}={\begin{bmatrix}-0.36\\-21.6\end{bmatrix}}{\frac {1}{(1-0.94^{1})}}={\begin{bmatrix}-6\\-360\end{bmatrix}}}$
${\displaystyle {\hat {v}}_{1}={\begin{bmatrix}0.4392\\1581.12\end{bmatrix}}{\frac {1}{(1-0.9878^{1})}}={\begin{bmatrix}36\\129600\end{bmatrix}}}$

Finally, the weight update is;

${\displaystyle \theta _{0}=10-0.01\cdot -6/({\sqrt {36}}+10^{-8})=10.01}$
${\displaystyle \theta _{1}=1-0.01\cdot -360/({\sqrt {129600}}+10^{-8})=1.01}$

The procedure is repeated until the parameters are converged giving values for ${\displaystyle \theta }$ as [11.39,2].

## Conclusion

Research has shown that Adam has demonstrated superior experimental performance over all the other optimizers such as AdaGrad, SGD, RMSP, etc in DNN.[8] This type of optimizer is useful for large datasets. As we know this optimizer is a combination of Momentum and RMSP optimization algorithms. This method is pretty much straightforward, easy to use, and requires less memory. Also, we have shown an example where all the optimizers are compared, and the results are shown with the help of the graph. Overall, it is a robust optimizer and well suited for non-convex optimization problems in the field of Machine Learning and Deep Learning. [9]

## References

1. A. Agnes Lydia and , F. Sagayaraj Francis, Adagrad - An Optimizer for Stochastic Gradient Descent, Department of Computer Science and Engineering, Pondicherry Engineering College, May 2019.
2. Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: neural networks for machine learning, 4(2):26–31, 2012.
3. John Pomerat, Aviv Segev, and Rituparna Datta, On Neural Network Activation Functions and Optimizers in Relation to Polynomial Regression, 2019 IEEE International Conference on Big Data (Big Data).
4. Zijun Zhang, Improved Adam Optimizer for Deep Neural Networks, ©2018 IEEE.
5. Wendyam Eric Lionel Ilboudo, Taisuke Kobayashi, Kenji Sugimoto, TAdam: A Robust Stochastic Gradient Optimizer, [cs.LG] 3 Mar 2020.
6. Diederik P. Kingma, Jimmy Lei Ba, Adam: A Method For Stochastic Optimization, Published as a conference paper at ICLR 2015.
7. AATILA Mustapha, LACHGAR Mohamed and KARTIT Ali, Comparative study of optimization techniques in deep learning: Application in the ophthalmology field, The International Conference on Mathematics & Data Science (ICMDS) 2020.
8. Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research.
9. Ameer Hamza Khan, Xinwei Cao, Shuai Li, Vasilios N. Katsikis, and Liefa Liao, Bas-Adam: An Adam Based Approach to Improve the Performance of Beetle Antennae Search Optimizer, IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 7, NO. 2, MARCH 2020.