Adam: Difference between revisions

Revision as of 17:14, 29 November 2021

Author: Akash Ajagekar (SYSEN 6800 Fall 2021)

Introduction

Adam optimizer is the extended version of stochastic gradient descent which has broader scope in future for deep learning applications in computer vision and natural processing. It is an optimization algorithm that can be an alternative for stochastic gradient descent process. The name is derived from adaptive moment estimation. Adam is proposed as the most efficient stochastic optimization which only requires first order gradients where memory requirement too less.^[1] Before Adam many adaptive optimization techniques were introduced such as AdaGrad, RMSP which have good performance over SGD but in some cases have some disadvantages such as generalizing performance which is worse than that of the SGD in some cases. So Adam was introduced which is better in terms of generalizing performance.

Theory

Adam is a combination of two gradient descent methods which are explained below,

Momentum:

This is a optimization algorithm which takes into consideration the 'exponentially weighted average' and accelerates the gradient descent. It is an extension of gradient descent optimization algorithm.

The Momentum algorithm is solved in two parts. First is to calculate the change to position and second one is to update the old position with the updated position. The change in position is given by,

update = α * m_t

The new position or weights at time t is given by,

w_t+1 = w_t - update

Here in the above equation α(Step Size) is the Hyperparameter which controls the movement in the search space which is also called as learning rate. And, f'(x) is the derivative function or aggregate of gradients at time t.

where,

m_t = β * m_t - 1 + (1 - β) * (∂L / ∂w_t)

In the above equations m_t and m_t-1 are aggregate of gradients at time t and aggregate of gradient at time t-1.

According to ^[2] Momentum has the effect of dampening down the change in the gradient and, in turn, the step size with each new point in the search space.

Root Mean Square Propagation (RMSP):

RMSP is an adaptive optimization algorithm which is a improved version of AdaGrad . In AdaGrad we take the cumulative summation of squared gradients but, in RMSP we take the 'exponential average'.

It s given by,

w_t+1 = w_t - (αt / (vt + e) ^ 1/2) * (∂L / ∂w_t)

where,

vt = βvt - 1 + (1 - β) * (∂L / ∂w_t) ^ 2

Here,

Aggregate of gradient at t = m_t

Aggregate of gradient at t - 1 = m_t - 1

Weights at time t = w_t

Weights at time t + 1 = w_t + 1

αt = learning rate(Hyperparameter)

∂L = derivative of loss function

∂w_t = derivative of weights at t

β = Average parameter

e = constant

But as we know these two optimizers explained below have some problems such as generalizing performance. The article ^[3] tells us that Adam takes over the attributes of the above two optimizers and build upon them to give more optimized gradient descent.

Algorithm

Taking the equations used in the above two optimizers

m_t = β1 * m_t - 1 + (1 - β1) * (∂L / ∂w_t) and vt = β2vt - 1 + (1 - β2) * (∂L / ∂w_t) ^ 2

Initially both mt and vt are set to 0. Both tend to be more biased towards ) as β1 and β2 are equal to 1. By computing bias corrected m_t and vt, this problem is corrected by the Adam optimizer. The equations are as follows,

m'_t = m_t / (1 - β1 ^ t)

v't = vt / (1 - β2 ^ t)

Now as we are getting used to gradient descent after every iteration and hence it remains controlled and unbiased. Now substitute the new parameters in place of the old ones. We get,

w_t+1 = w_t - m'_t ( α / v't^1/2 + e)

Performance

Adam optimizer gives much more higher performance results than the other optimizers and outperforms by a big margin for a better optimized gradient. The diagram below is one example of performance comparison of all the optimizers.

Numerical Example

Let's see an example of Adam optimizer. A sample dataset is shown below which as weight and height of couple of people. We have to predict the height of a person based on the given weight.

Hours Studying	60	76	85	76	50	55	100	105	45	78	57	91	69	74	112
Exam Grad	76	72.3	88	60	79	47	67	66	65	61	68	56	75	57	76

The hypothesis function is,

$f_{\theta }(x)=\theta _{0}+\theta _{1}x.$

The cost function is,

$J({\theta })={\frac {1}{2}}\sum _{i}^{n}{\big (}f_{\theta }(x_{i})-y_{i}{\big )}^{2},$

The optimization problem is defined as, we have to find the values of theta which help to minimize the objective function mentioned below,

$\mathrm {argmin} _{\theta }\quad {\frac {1}{n}}\sum _{i}^{n}{\big (}f_{\theta }(x_{i})-y_{i}{\big )}^{2}$

The cost function with respect to the weights $\theta _{0}$ and $\theta _{1}$ are,

${\frac {\partial J(\theta )}{\partial \theta _{0}}}={\big (}f_{\theta }(x)-y{\big )}$
${\frac {\partial J(\theta )}{\partial \theta _{1}}}={\big (}f_{\theta }(x)-y{\big )}x$

The initial values of ${\theta }$ will be set to [10, 1] and the learning rate $\alpha$ , is set to 0.01 and setting the parameters $\beta _{1}$ , $\beta _{2}$ , and e as 0.94, 0.9878 and 10^-8 respectively. Starting from the first data sample the gradients are,

${\frac {\partial J(\theta )}{\partial \theta _{0}}}={\big (}(10+1\cdot 60-76{\big )}=-6$
${\frac {\partial J(\theta )}{\partial \theta _{1}}}={\big (}(10+1\cdot 60-76{\big )}\cdot 60=-360$

Here $m_{0}$ and $v_{0}$ are zero, $m_{1}$ and $v_{1}$ are calculated as

$m_{1}=0.94\cdot 0+(1-0.94)\cdot {\begin{bmatrix}-6\\-360\end{bmatrix}}={\begin{bmatrix}-0.36\\-21.6\end{bmatrix}}$
$v_{1}=0.9878\cdot 0+(1-0.9878)\cdot {\begin{bmatrix}-6^{2}\\-360^{2}\end{bmatrix}}={\begin{bmatrix}0.4392\\1581.12\end{bmatrix}},$

The new bias-corrected values of $m_{1}$ and $v_{1}$ are,

${\hat {m}}_{1}={\begin{bmatrix}-0.36\\-21.6\end{bmatrix}}{\frac {1}{(1-0.94^{1})}}={\begin{bmatrix}-6\\-360\end{bmatrix}}$
${\hat {v}}_{1}={\begin{bmatrix}0.4392\\1581.12\end{bmatrix}}{\frac {1}{(1-0.9878^{1})}}={\begin{bmatrix}36\\129600\end{bmatrix}}.$

Finally, the weight update is,

$\theta _{0}=10-0.01\cdot -6/({\sqrt {36}}+10^{-8})=10.01$
$\theta _{1}=1-0.01\cdot -360/({\sqrt {129600}}+10^{-8})=1.01$

The procedure is repeated until the values of the weights are converged.

Applications

The Adam optimization algorithm is the replacement optimization algorithm for SGD for training DNN. According to Adam combines the best properties of the AdaGrad and RMSP algorithms to provide an optimization algorithm that can handle sparse gradients on noisy problems. Adam is proved to be the best optimizer amongst all the other optimizers such as AdaGrad, SGD, RMSP etc. Further research is going on Adaptive optimizers for Federated Learning and their performances are being compared. Federated Learning is a privacy preserving technique which is an alternative for Machine Learning where data training is done on the device itself without sharing it with the cloud server.

Variants of Adam

AdaMax

AdaMax^[4] is a variant of the Adam algorithm proposed in the original Adam paper that uses an exponentially weighted infinity norm instead of the second-order moment estimate. The weighted infinity norm updated $u_{t}$ , is computed as

$u_{t}=\max(\beta _{2}\cdot u_{t-1},|g_{t}|).$

The parameter update then becomes

$\theta _{t}=\theta _{t-1}-(\alpha /(1-\beta _{1}^{t}))\cdot m_{t}/u_{t}.$

Nadam

The Nadam algorithm^[5] was proposed in 2016 and incorporates the Nesterov Accelerate Gradient (NAG)^[6], a popular momentum like SGD variation, into the first-order moment term.

Conclusion

Adam is a variant of the gradient descent algorithm that has been widely adopted in the machine learning community. Adam can be seen as the combination of two other variants of gradient descent, SGD with momentum and RMSProp. Adam uses estimations of the first and second-order moments of the gradient to adapt the parameter update. These moment estimations are computed via moving averages, $m_{t}$ and $v_{t}$ , of the gradient and the squared gradient respectfully. In a variety of neural network training applications, Adam has shown increased convergence and robustness over other gradient descent algorithms and is often recommended as the default optimizer for training.^[7]

References

↑ https://arxiv.org/pdf/1412.6980.pdf ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION
↑ Deep Learning (Adaptive Computation and Machine Learning series)
↑ https://www.geeksforgeeks.org/intuition-of-adam-optimizer/ Intuition of Adam Optimizer
↑ ^4.0 ^4.1 Cite error: Invalid <ref> tag; no text was provided for refs named adam
↑ Dozat, Timothy. Incorporating Nesterov Momentum into Adam. ICLR Workshop, no. 1, 2016, pp. 2013–16.
↑ Nesterov, Yuri. A method of solving a convex programming problem with convergence rate O(1/k^2). In Soviet Mathematics Doklady, 1983, pp. 372-376.
↑ "Neural Networks Part 3: Learning and Evaluation," CS231n: Convolutional Neural Networks for Visual Recognition, Stanford Unversity, 2020

[1] ttps://arxiv.org/pdf/1412.6980.pdf ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION

[2] Deep Learning (Adaptive Computation and Machine Learning series)

[3] ttps://www.geeksforgeeks.org/intuition-of-adam-optimizer/ Intuition of Adam Optimizer

[adam-4] 4.0 ^4.1 Cite error: Invalid <ref> tag; no text was provided for refs named adam

[5] Dozat, Timothy. Incorporating Nesterov Momentum into Adam. ICLR Workshop, no. 1, 2016, pp. 2013–16.

[6] Nesterov, Yuri. A method of solving a convex programming problem with convergence rate O(1/k^2). In Soviet Mathematics Doklady, 1983, pp. 372-376.

[7] "Neural Networks Part 3: Learning and Evaluation," CS231n: Convolutional Neural Networks for Visual Recognition, Stanford Unversity, 2020

[1]

[2]

[3]

[4]

[5]

[6]

[7]

@@ Line 91: / Line 91: @@
-To illustrate how updates occur in the Adam algorithm, consider a linear, least-squares regression problem formulation. The table below shows a sample data-set of student exam grades and the number of hours spent studying for the exam. The goal of this example will be to generate a linear model to predict exam grades as a function of time spent studying.
+Let's see an example of Adam optimizer. A sample dataset is shown below which as weight and height of couple of people. We have to predict the height of a person based on the given weight.
 {| class="wikitable"
 |-
-| Hours Studying || 9.0 || 4.9 || 1.6 || 1.9 || 7.9 || 2.0 || 11.5 || 3.9 || 1.1 || 1.6 || 5.1 || 8.2 || 7.3 || 10.4 || 11.2
+| Hours Studying || 60 || 76 || 85 || 76 || 50 || 55 || 100 || 105 || 45 || 78 || 57 || 91 || 69 || 74 || 112
 |-
-| Exam Grad || 88.0 || 72.3 || 66.5 || 65.1 || 79.5 || 60.8 || 94.3, || 66.7 || 65.4 || 63.8 || 68.4 || 82.5 || 75.9 || 87.8 || 85.2
+| Exam Grad || 76 || 72.3 || 88 || 60 || 79 || 47 || 67 || 66 || 65 || 61 || 68 || 56 || 75 || 57 || 76
 |}
-The hypothesized model function will be
+The hypothesis function is,
 <math>f_\theta(x) = \theta_0 + \theta_1 x.</math>
-The cost function is defined as
+The cost function is,
 <math> J({\theta}) =  \frac{1}{2}\sum_i^n \big(f_\theta(x_i) - y_i \big)^2, </math>
-Where the <math>1/2</math> coefficient is used only to make the derivatives cleaner. The optimization problem can then be formulated as trying to find the values of <math>\theta</math> that minimize the squared residuals of <math>f_\theta(x)</math> and <math>y</math>.
+The optimization problem is defined as, we have to find the values of theta which help to minimize the objective function mentioned below,
 <math> \mathrm{argmin}_{\theta} \quad \frac{1}{n}\sum_{i}^n \big(f_\theta(x_i) - y_i \big) ^2 </math>
-For simplicity, parameters will be updated after every data point i.e. a batch size of 1. For a single data point the derivatives of the cost function with respect to <math>\theta_0</math> and <math>\theta_1</math> are
+The cost function with respect to the weights <math>\theta_0</math> and <math>\theta_1</math> are,
 <math> \frac{\partial J(\theta)}{\partial \theta_0} = \big(f_\theta(x) - y \big)   </math><br/>
 <math> \frac{\partial J(\theta)}{\partial \theta_1} = \big(f_\theta(x) - y \big) x </math>
-The initial values of <math>{\theta}</math> will be set to [50, 1] and  The learning rate, <math>\alpha</math>, is set to 0.1 and the suggested parameters for <math>\beta_1</math>, <math>\beta_2</math>, and <math>\epsilon</math> are used. With the first data sample of <math> (x,y)=[8.98, 88.01]</math>, the computed gradients are
+The initial values of <math>{\theta}</math> will be set to [10, 1] and the learning rate <math>\alpha</math>, is set to 0.01 and setting the parameters <math>\beta_1</math>, <math>\beta_2</math>, and e as 0.94, 0.9878 and 10^-8 respectively. Starting from the first data sample the gradients are,
-<math> \frac{\partial J(\theta)}{\partial \theta_0} = \big((50 + 1\cdot 9 - 88.01 \big) = -29.0  </math><br/>
+<math> \frac{\partial J(\theta)}{\partial \theta_0} = \big((10 + 1\cdot 60 - 76 \big) = -6  </math><br/>
-<math> \frac{\partial J(\theta)}{\partial \theta_1} = \big((50 + 1\cdot 9 - 88.01 \big)\cdot 9.0 = -261  </math><br/>
+<math> \frac{\partial J(\theta)}{\partial \theta_1} = \big((10 + 1\cdot 60 - 76 \big)\cdot 60 = -360  </math><br/>
-With <math>m_0</math> and <math>v_0</math> being initialized to zero, the calculations of <math>m_1</math> and <math>v_1</math> are
+Here <math>m_0</math> and <math>v_0</math> are zero, <math>m_1</math> and <math>v_1</math> are calculated as
-<math> m_1 = 0.9 \cdot 0 + (1-0.9) \cdot \begin{bmatrix} -29\\ -261 \end{bmatrix} = \begin{bmatrix} -2.9\\ -26.1\end{bmatrix} </math> <br/>
+<math> m_1 = 0.94 \cdot 0 + (1-0.94) \cdot \begin{bmatrix} -6\\ -360 \end{bmatrix} = \begin{bmatrix} -0.36\\ -21.6\end{bmatrix} </math> <br/>
-<math> v_1 = 0.999\cdot 0 + (1-0.999) \cdot \begin{bmatrix} -29^2\\-261^2 \end{bmatrix} = \begin{bmatrix} 0.84\\ 68.2\end{bmatrix} , </math> <br/>
+<math> v_1 = 0.9878\cdot 0 + (1-0.9878) \cdot \begin{bmatrix} -6^2\\-360^2 \end{bmatrix} = \begin{bmatrix} 0.4392\\ 1581.12\end{bmatrix} , </math> <br/>
-The bias-corrected terms are computed as
+The new bias-corrected values of <math>m_1</math> and <math>v_1</math> are,
-<math> \hat{m}_1 = \begin{bmatrix} -2.9\\ -26.1\end{bmatrix} \frac{1}{ (1-0.9^1)} =  \begin{bmatrix} -29.0\\-261.1\end{bmatrix}</math> <br/>
+<math> \hat{m}_1 = \begin{bmatrix} -0.36\\ -21.6\end{bmatrix} \frac{1}{ (1-0.94^1)} =  \begin{bmatrix} -6\\-360\end{bmatrix}</math> <br/>
-<math> \hat{v}_1 = \begin{bmatrix} 0.84\\ 68.2\end{bmatrix}  \frac{1} {(1-0.999^1)} = \begin{bmatrix} 851.5\\68168\end{bmatrix}. </math> <br/>
+<math> \hat{v}_1 = \begin{bmatrix} 0.4392\\ 1581.12\end{bmatrix}  \frac{1} {(1-0.9878^1)} = \begin{bmatrix} 36\\129600\end{bmatrix}. </math> <br/>
-Finally, the parameter update is
+Finally, the weight update is,
-<math> \theta_0 = 50 - 0.1 \cdot -29 / (\sqrt{851.5} + 10^{-8}) = 50.1 </math> <br/>
+<math> \theta_0 = 10 - 0.01 \cdot -6 / (\sqrt{36} + 10^{-8}) = 10.01 </math> <br/>
-<math> \theta_1 = 1 - 0.1 \cdot -261 / (\sqrt{68168} + 10^{-8}) = 1.1 </math> <br/>
+<math> \theta_1 = 1 - 0.01 \cdot -360 / (\sqrt{129600} + 10^{-8}) = 1.01 </math> <br/>
+The procedure is repeated until the values of the weights are converged.
-This procedure is repeated until the parameters have converged, giving <math>\theta</math> values of <math>[58.98, 2.72]</math>. The figures to the right show the trajectory of the Adam algorithm over a contour plot of the objective function and the resulting model fit. It should be noted that the stochastic gradient descent algorithm with a learning rate of 0.1 diverges and with a rate of 0.01, SGD oscillates around the global minimum due to the large magnitudes of the gradient in the <math>\theta_1</math> direction.
 == Applications ==
-The Adam optimization algorithm has been widely used in machine learning applications to train model parameters. When used with backpropagation, the Adam algorithm has been shown to be a very robust and efficient method for training artificial neural networks and is capable of working well with a variety of structures and applications. In their original paper, the authors present three different training examples, logistic regression, multi-layer neural networks for classification of MNIST images, and a convolutional neural network (CNN). The training results from the original Adam paper showing the objective function cost vs. the iteration over the entire data set for the multi-layer neural network is shown to the right.
+The Adam optimization algorithm is the replacement optimization algorithm for SGD for training DNN. According to  Adam combines the best properties of the AdaGrad and RMSP algorithms to provide an optimization algorithm that can handle sparse gradients on noisy problems. Adam is proved to be the best optimizer amongst all the other optimizers such as AdaGrad, SGD, RMSP etc. Further research is going on Adaptive optimizers for Federated Learning and their performances are being compared. Federated Learning is a privacy preserving technique which is an alternative for Machine Learning where data training is done on the device itself without sharing it with the cloud server.
 == Variants of Adam ==