Author: Vinamr Jain (vj89) (ChemE 6800 Fall 2024)
Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu
Introduction
Nesterov-accelerated Adaptive Moment Estimation (Nadam) is an optimization algorithm introduced by Timothy Dozat in his 2016 paper, “Incorporating Nesterov Momentum into Adam” [1]. Nadam enhances the Adaptive Moment Estimation (Adam) optimizer by integrating Nesterov momentum, aiming to improve convergence speed and performance in training deep learning models. This algorithm has been applied across various domains, including computer vision and natural language processing, to optimize neural network training. Gradient descent optimization algorithms are fundamental tools in machine learning, often employed as black-box optimizers to minimize loss functions by iteratively adjusting model parameters. Gradient Descent updates parameters by following the negative gradient of the loss function to seek a minimum, but can be slow to converge, especially in regions with flat or elongated curvatures [2]. Several variants address these challenges. Momentum accelerates convergence by accumulating a decaying sum of past gradients, helping navigate ravines and reducing oscillations [1]. Nesterov’s Accelerated Gradient (NAG), also known as Nesterov momentum, improves upon Momentum by calculating the gradient at a projected future position, often resulting in faster convergence. RMSprop adapts learning rates for each parameter individually and normalizes gradients by a decaying average of recent magnitudes, effectively handling non-stationary objectives [3]. Adaptive Moment Estimation (Adam) [4] combines the benefits of Momentum and adaptive learning rates, computing individual adaptive rates based on first and second moments of gradients. However, Adam may sometimes converge to suboptimal solutions due to its adaptive learning rate mechanism [3] and can slow down before reaching global optima [2]. Nadam extends Adam by incorporating Nesterov momentum, evaluating the gradient at a projected future parameter position and improving convergence speed and accuracy.
Background
Gradient Descent is a first-order iterative optimization algorithm used to find a local minimum of a differentiable function. The parameter update rule at iteration
is:
Where
denotes the parameters vector at iteration
,
is the learning rate, and
is the gradient of the cost function
with respect to
at iteration
. Momentum accelerates Gradient Descent by accumulating an exponentially decaying moving average of past gradients. This accelerates gradient descent along stable gradient directions and slows it in oscillating directions [1].
Here in addition to the gradient component, there’s an additional term
, the momentum vector, and
is the momentum coefficient. Nesterov Momentum improves upon standard Momentum by calculating the gradient at the anticipated future position, leading to a more responsive update.
Alternatively, one can apply the current look-ahead momentum vector directly to update the current parameters:
AdaGrad (Adaptive Gradient Algorithm) adapts the learning rate for each parameter individually by scaling it inversely proportional to the square root of the sum of all historical squared values of the gradients. This method is particularly useful for dealing with sparse data [3].
is the accumulated sum of squared gradients up to time
, and
is a small constant for numerical stability (e.g.,
). AdaGrad effectively scales down the learning rate for parameters associated with frequently occurring features, allowing for larger updates for infrequent features [1].
RMSprop (Root Mean Square Propagation) is an optimization algorithm that builds upon AdaGrad by modifying the way it accumulates squared gradients. While AdaGrad accumulates all past squared gradients, leading to a continually decreasing learning rate, RMSprop introduces a decay parameter to the accumulated squared gradients
. This parameterization prevents the learning rate from diminishing too quickly.
is the decay rate (typically
). By introducing the decay parameter in the
term, RMSprop limits the influence of older gradients, ensuring that the algorithm adapts quickly to the most recent gradients.
Adam (Adaptive Moment Estimation) is an optimization algorithm that combines ideas from both Momentum and RMSprop. It extends RMSprop by not only adapting the learning rates per parameter but also incorporating momentum by using an exponentially decaying average of past gradients (first moment). Adam is designed to work well with sparse gradients and noisy problems, which are common in training deep neural networks.
is the first moment estimate (exponentially decaying average of past gradients),
is the second moment estimate (exponentially decaying average of past squared gradients), and
and
are the exponential decay rates for the moment estimates (commonly
).
As
and
are initialized at zero vectors, they are biased towards zero, especially during the initial time steps and particularly when the decay rates are close to one (i.e.,
). This bias can slow down the convergence of the algorithm because the moment estimates underestimate the true mean and variance of the gradients. To counteract these biases, the authors of Adam introduced bias-corrected moment estimates:
The denominators
and
approach 1 as
increases, reducing the impact of the bias over time. The bias correction compensates for the initialization at zero, ensuring that the moment estimates are unbiased. The parameters are then updated using the bias-corrected moment estimates:
Algorithm
Nadam modifies the parameter update step of Adam by integrating the Nesterov accelerated gradient. In Adam, the update relies solely on the bias-corrected first moment estimate
. Nadam modifies the parameter update step of Adam by integrating the Nesterov accelerated gradient. In Adam, the update relies solely on the bias-corrected first moment estimate
. Nadam, however, adjusts this estimate by adding a term that accounts for the current gradient, effectively looking ahead to where the parameters are heading. This adjustment is similar to the Nesterov momentum in NAG, which computes the gradient at the approximate future position of the parameters. Firstly, consider the bias-corrected first moment
using the definition of
[3]
Where
. The update rule of Adam can be rewritten as
Recognizing that
is the bias-corrected estimate of the momentum vector from the previous time step, let’s denote:
However, since
is close to
for large
, it can be approximated as:
Thus, the update rule becomes:
This form resembles the momentum term in Nesterov Accelerated Gradient (NAG). To incorporate Nesterov momentum into Adam, the previous bias-corrected momentum estimate
is replaced with the current estimate
. This aligns with the NAG approach of using the look-ahead momentum vector for updating parameters.
The Nadam update rule becomes:
By combining the bias-corrected first moment estimate
and the current gradient
, weighted appropriately, Nadam aims to improve the responsiveness of the optimizer to recent gradient information, potentially enhancing convergence speed.
The pseudocode for Nadam is as follows:
Algorithm Nadam Optimization
Input: Initial parameters
, learning rate
, decay rates
,
, small constant
Initialize
While not converged do
Compute gradient:
Update biased first moment:
Update biased second moment:
Bias correction:
Nesterov adjustment:
Update parameters:
Output: Optimized parameters
Numerical Examples
This section provides numerical examples to illustrate how the Nadam algorithm operates. We choose a simple quadratic function to make the calculations straightforward and easy to follow, and then we demonstrate the numerical calculations for a linear regression task.
Minimizing a Quadratic Function
Consider the cost function:
which has its global minimum at
. The aim is to minimize this function using the Nadam optimizer, starting from an initial parameter value of
.
The hyperparameters are set as follows: learning rate
, exponential decay rates
, numerical stability constant
, and initialize the first and second moment estimates and the time step as
.
Iteration 1
1. Time Step Update:
2. Compute Gradient at Current Parameter:
3. Update Biased First Moment Estimate:
4. Update Biased Second Moment Estimate:
5. Compute Bias-Corrected Estimates:
6. Nesterov Adjustment for First Moment:
7. Parameter Update:
Iteration 2
1. Time Step Update:
2. Compute Gradient at Current Parameter:
3. Update Biased First Moment Estimate:
4. Update Biased Second Moment Estimate:
5. Compute Bias-Corrected Estimates:
6. Nesterov Adjustment for First Moment:
7. Parameter Update:
Iteration 3
1. Time Step Update:
2. Compute Gradient at Current Parameter:
3. Update Biased First Moment Estimate:
4. Update Biased Second Moment Estimate:
5. Compute Bias-Corrected Estimates:
6. Nesterov Adjustment for First Moment:
7. Parameter Update:
After three iterations, the parameter values have been updated as follows:
The parameter
is gradually moving towards the global minimum at
, demonstrating how the Nadam optimizer adjusts the parameters efficiently.