Adadelta

Author: Imran Shita-Bey (ias45), Dhruv Misra (dm668), Ifadhila Affia (ia284) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

Introduction

Optimization algorithms (“optimizers”) are analogous, in a way, to engines in cars. Just as different engines are designed to suit various driving requirements—some prioritizing speed, others efficiency—different optimizers are tailored to solve distinct types of problems. In machine learning, optimization algorithms like Adadelta adjust learning rates during neural network training in a somewhat-uniquely dynamic manner, enabling efficient and stable convergence. Introduced by Matthew D. Zeiler, Adadelta was developed to address the limitations of earlier adaptive methods, particularly Adagrad's vanishing learning rates in prolonged training sessions ^[1]. This innovation made Adadelta especially valuable as a method that does not need to manually tune learning rates, and therefore “appears robust to noisy gradient information, different model architecture choices, various data modalities and selection of hyperparameters” ^[1].

Similar to how even similarly sized cars can come with a myriad of engine types, other optimization algorithms, such as RMSprop ^[2], Adam ^[3], and Nadam ^[4], exist as well. Each algorithm has unique characteristics and trade-offs, much like how different engines excel in specific environments. For instance, Adam combines the strengths of momentum and adaptive learning rates to provide a versatile option for training models ^[3]. The existence of multiple optimizers emphasizes their role as the "engines" powering machine learning, allowing models to tackle a wide array of challenges.

Historically, gradient-based optimization techniques evolved to tackle challenges in training models with large parameter spaces, such as vanishing gradients and instability in updates ^[5]. Adadelta marked a critical milestone by automating learning rate adjustments, eliminating the need for manual hyperparameter tuning, and maintaining computational efficiency ^[1]. Its principles influenced the development of other algorithms like RMSprop and Adam, further showcasing how optimizers evolve to address specific needs, namely, adaptability.

The study of Adadelta and related optimization algorithms is driven by the goal of improving machine learning model training—minimizing loss functions effectively, reducing training time, and achieving better generalization. By understanding Adadelta’s mechanism and its relationship to other optimizers, researchers and practitioners can make informed decisions when selecting tools for specific machine learning tasks, optimizing performance across diverse applications ^[5].

Algorithm Discussion

Adadelta determines adaptive learning rates by maintaining running averages of squared gradients and parameter updates over a window. Using only first-order gradient information, these averages allow the algorithm to automatically scale the learning rate for each parameter based on their historical behavior.

The core update rule for Adadelta at time t is:

$\Delta x_{t}=-{\frac {{\text{RMS}}[\Delta x]_{t-1}}{{\text{RMS}}[g]_{t}}}g_{t}$

where:

$g_{t}$ is the gradient at time $t$ ,
${\text{RMS}}[g]_{t}$ is the root mean square of accumulated gradients,
${\text{RMS}}[\Delta x]_{t-1}$ is the root mean square of accumulated parameter updates,
$\Delta x_{t}$ is the parameter update at time $t$ .

The RMS terms are computed using exponentially decaying averages:

$\mathbb {E} [g^{2}]_{t}=\rho \mathbb {E} [g^{2}]_{t-1}+(1-\rho )g_{t}^{2}$

${\text{RMS}}[g]_{t}={\sqrt {\mathbb {E} [g^{2}]_{t}+\epsilon }}$

where:

$\rho$ is the decay constant,
$\epsilon$ is a small constant for numerical stability.

The complete algorithm proceeds as follows:

Initialize accumulation variables $\mathbb {E} [g^{2}]_{0}=0$ and $\mathbb {E} [\Delta x^{2}]_{0}=0$
For each time step $t$ $t$ :
1. Compute gradient $g_{t}$
2. Accumulate squared gradient: $\mathbb {E} [g^{2}]_{t}=\rho \mathbb {E} [g^{2}]_{t-1}+(1-\rho )g_{t}^{2}$
3. Compute update: $\Delta x_{t}=-{\frac {\sqrt {\mathbb {E} [\Delta x^{2}]_{t-1}+\epsilon }}{\sqrt {\mathbb {E} [g^{2}]_{t}+\epsilon }}}g_{t}$
4. Accumulate squared updates: $\mathbb {E} [\Delta x^{2}]_{t}=\rho \mathbb {E} [\Delta x^{2}]_{t-1}+(1-\rho )(\Delta x_{t})^{2}$
5. Apply update: $x_{t+1}=x_{t}+\Delta x_{t}$

This algorithm assumes:

The objective function is differentiable with respect to all parameters,
Parameters can be updated independently,
The exponential decay provides a reasonable approximation of recent gradient history,
The RMS values of $\Delta x$ and $g$ can serve as an approximation to the diagonal Hessian.

Numerical Examples

Example 1

Consider minimizing the quadratic loss function $f(x)=x^{2}$ , starting at $x=5$ using Adadelta with $\rho =0.95$ and $\epsilon =10^{-6}$ .

Note that:

The gradient is ${\frac {df}{dx}}=2x$ ,
The goal is to reach the minimum at $x=0$ .

The general update equations for both gradient and update accumulation are:

Exponential Moving Average: $\mathbb {E} [y^{2}]_{t}=\rho \mathbb {E} [y^{2}]_{t-1}+(1-\rho )y_{t}^{2}$ .
Root Mean Square: ${\text{RMS}}[y]_{t}={\sqrt {\mathbb {E} [y^{2}]_{t}+\epsilon }}$ .

Where $y$ can be either $g$ (gradient) or $\Delta x$ (parameter update).

First Iteration:

Initial conditions: $x_{1}=5$ , $\mathbb {E} [g^{2}]_{0}=0$ , $\mathbb {E} [\Delta x^{2}]_{0}=0$
Gradient: $g_{1}=2(5)=10.00$

Computing step:

Gradient accumulation:
1. $\mathbb {E} [g^{2}]_{1}=0.95(0)+(1-0.95)(100.0)=5.000$
2. ${\text{RMS}}[g]_{1}={\sqrt {5.000+10^{-6}}}=2.236$
Parameter update:
1. ${\text{RMS}}[\Delta x]_{0}={\sqrt {0+10^{-6}}}=1.000\times 10^{-3}$
2. $\Delta x_{1}=-{\frac {{\text{RMS}}[\Delta x]_{0}}{{\text{RMS}}[g]_{1}}}g_{1}=-{\frac {1.000\times 10^{-3}}{2.236}}(10.00)=-4.472\times 10^{-3}$
Update accumulation:
1. $\mathbb {E} [\Delta x^{2}]_{1}=0.95(0)+(1-0.95)((4.472\times 10^{-3})^{2})=1.000\times 10^{-6}$
2. $x_{2}=5+(-4.472\times 10^{-3})=4.996$

Second Iteration:

Current position: $x_{2}=4.996$
Gradient: $g_{2}=2(4.996)=9.992$

Computing step:

Gradient accumulation:
1. $\mathbb {E} [g^{2}]_{2}=0.95(5.000)+(1-0.95)(99.84)=9.742$
2. ${\text{RMS}}[g]_{2}={\sqrt {9.742+10^{-6}}}=3.121$
Parameter update:
1. ${\text{RMS}}[\Delta x]_{1}={\sqrt {1.000\times 10^{-6}+10^{-6}}}=1.414\times 10^{-3}$
2. $\Delta x_{2}=-{\frac {{\text{RMS}}[\Delta x]_{1}}{{\text{RMS}}[g]_{2}}}g_{2}=-{\frac {1.414\times 10^{-3}}{3.121}}(9.992)=-4.524\times 10^{-3}$
Update accumulation:
1. $\mathbb {E} [\Delta x^{2}]_{2}=0.95(1.000\times 10^{-6})+(1-0.95)((4.524\times 10^{-3})^{2})=1.023\times 10^{-6}$
2. $x_{3}=4.996+(-4.524\times 10^{-3})=4.991$

The continuation of this process, displayed in Figure 1, shows that it takes about 1500 iterations to converge.

Figure 1. Iterating $x$ and $\Delta x$ during the minimization of $f(x)=x^{2}$ using the Adadelta algorithm.

Example 2

Next, Adadelta is applied to optimize a more complex function, $u(x)=x^{2}+10\sin(2x)$ , such that it equals 10. Mean squared error will be used as the loss function, resulting in the following:

${\text{Loss}}=f(x)=(10-(x^{2}+10\sin(2x)))^{2}$

${\frac {df}{dx}}=-2(10-x^{2}-10\sin(2x))(2x+20\cos(2x))$

Starting at $x_{0}=0.85$ (shown in Figure 2), apply the Adadelta algorithm described in the $f(x)=x^{2}$ case to minimize the new loss function. Figure 3 tracks the algorithm's progress showing both $\Delta x$ (parameter updates) and the loss function value over 50 iterations where it converges at approximately iteration 30 where $x=1.01355$ and $u(x)=10.00415$ .

Figure 3. Iterating Loss and $\Delta x$ during the minimization of ${\text{Loss}}=f(x)=(10-(x^{2}+10\sin(2x)))^{2}$ using the Adadelta algorithm.

Application

The Adadelta optimization algorithm is commonly used in deep learning systems with sparse gradients ^[1]. Adadelta particularly excels in training complex neural architectures such as deep convoluted neural networks and sequence models, where gradient magnitudes may vary significantly across different layers. Its adaptive learning rate algorithm makes it best applied in architectures with varying parameter scales. Adadelta has been applied as a core optimizer in deep learning frameworks such as TensorFlow ^[6], PyTorch ^[7], and Keras^[8], each providing specific features and usage.

TensorFlow: In TensorFlow, Adadelta is available in the tf.keras.optimizers module. This framework is designed with a clearly defined API for setting up Adadelta with popular defaults for learning rate, rho, and epsilon. Users are allowed to change these for better results that best fit model requirements, hence flexible and easy to use.
PyTorch: In PyTorch, Adadelta is available in the torch.optim module. This implementation is highly configurable and naturally interoperates well with the dynamic computational graph in PyTorch, which makes it very suitable for training models where the layer structure is complex and/or where the gradient behavior is irregular.
Keras: In Keras, Adadelta is available in the keras.optimizers module. The model can be executed directly during the process of model compilation, providing a straightforward yet efficient interface. The architecture emphasizes user-friendliness, enabling developers to focus on the construction and enhancement of their models while taking advantage of Adadelta's adaptive learning features.

While another similar optimizer, Adam, has recently gained more popularity in application because of its adaptive properties ^[9], Adadelta is still relevantly applicable for specific use cases in favor of per-dimension learning rate adaptation.

For its adaptive nature, Adadelta has the potential to be applied in both engineering and finance ^[10]. In engineering, deep learning models, such as recurrent neural networks (RNNs), are widely used to analyze time-series sensor data in predicting when industrial machines need maintenance ^[11]. Although the benefits of Adadelta have not been clearly explained in this area, its ability to adaptively update gradients makes it a good option for improving these models. In finance, deep learning models have been widely studied for tasks including stock price prediction and credit risk assessment using the optimizer Adam and RMSprop ^[12]. Not many studies mention Adadelta, but its adaptation capability makes it a good choice for similar tasks in financial modeling. The incorporation of Adadelta in widely used frameworks such as TensorFlow, PyTorch, and Keras convinces its availability for various applications, enabling practitioners to integrate it into predictive analytics and financial modeling activities effectively.

Natural Language Processing (NLP)

The Adadelta optimization algorithm has been widely used in transformer architectures and RNNs concerning the different gradient scales for its effective gradient update capability ^[13]. This makes Adadelta a good choice for key NLP applications, including entity recognition, machine translation, text generation, and sequence-to-sequence learning tasks. Its ability to adjust learning rates adaptively is also particularly advantageous for training word embeddings and further network layers where gradient magnitudes may differ by several orders of magnitude. This adaptation allows smooth convergence even in challenging conditions like sparse and high-dimensional datasets.

Deep Neural Networks

The Adadelta optimization algorithm also exhibits stability when applied in network training with multiple layers. Its adaptive capability alleviates the vanishing and exploding gradient problems commonly found in deep architectures ^[14]. This has made Adadelta particularly useful in computer vision applications, especially when training deep convolutional neural networks for object detection, object classification, and semantic segmentation tasks. Its dynamic learning rate adjustment helps reduce manual tuning while accelerating model training on complex datasets.

Conclusion

Ultimately, Adadelta is an optimizer that stands on the shoulders of other algorithms before it, making for an advanced optimization algorithm, especially in the context of machine learning and deep neural networks. Adadelta’s key selling point is that its learning rates do not require manual tuning, and it instead adapts learning rates dynamically. This unique feature addresses the limitations of older methods such as Adagrad, circumventing the earlier-mentioned issue of vanishing learning rates that lead to hindered, prolonged training sessions. The algorithm's utilization of running averages of squared gradients and parameter updates allows for efficient and stable convergence, as demonstrated in the numerical example of minimizing a simple quadratic function.

In regard to Adadelta’s applications, there are a great number of use cases that benefit from an adaptive learning rate algorithm. For example, in complex neural architectures, where gradient magnitudes vary significantly across layers. Its relatively ubiquitous use in major deep learning frameworks such as TensorFlow, PyTorch, and Keras emphasized its robustness and utility. Adadelta also has seen innovative applications in further niche fields like NLP and computer vision. This can be credited to Adadelta’s facilitation of high-dimensional model training as well as training on sparse data. While newer algorithms like Adam have gained popularity due to their adaptive properties, Adadelta remains a valuable tool when there needs to be scrutiny on per-dimension learning rate adaptation. Understanding its mechanisms and advantages is essential for optimization/machine learning users to make informed decisions when selecting optimizers for specific tasks. Future developments may involve combining the strengths of Adadelta with other optimization methods to further enhance training efficiency and model performance across diverse applications.

References

^[1] ^[2] ^[3] ^[4] ^[5] ^[6] ^[7] ^[8] ^[9] ^[10] ^[11] ^[12] ^[13] ^[14]

↑ ^{Jump up to: 1.0} ^1.1 ^1.2 ^1.3 ^1.4 Zeiler, M. D. (2012). ADADELTA: An Adaptive Learning Rate Method. arXiv preprint. Retrieved from [1](https://arxiv.org/abs/1212.5701).
↑ ^{Jump up to: 2.0} ^2.1 Tieleman, T., & Hinton, G. (2012). Lecture 6.5 - RMSprop: Divide the Gradient by a Running Average of its Recent Magnitude. COURSERA: Neural Networks for Machine Learning.
↑ ^{Jump up to: 3.0} ^3.1 ^3.2 Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980. Retrieved from [2](https://arxiv.org/abs/1412.6980).
↑ ^{Jump up to: 4.0} ^4.1 Dozat, T. (2016). Incorporating Nesterov Momentum into Adam. In Proceedings of the International Conference on Learning Representations (ICLR) 2016, Workshop Track. Retrieved from [3](https://openreview.net/forum?id=OM0jvwB8jIp57ZJjtNEZ).
↑ ^{Jump up to: 5.0} ^5.1 ^5.2 Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
↑ ^{Jump up to: 6.0} ^6.1 TensorFlow Documentation. (n.d.). tf.keras.optimizers.Adadelta. Retrieved from [4](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adadelta).
↑ ^{Jump up to: 7.0} ^7.1 PyTorch Documentation. (n.d.). torch.optim.Adadelta. Retrieved from [5](https://pytorch.org/docs/stable/generated/torch.optim.Adadelta.html).
↑ ^{Jump up to: 8.0} ^8.1 Keras Documentation. (n.d.). Adadelta optimizer. Retrieved from [6](https://keras.io/api/optimizers/adadelta/).
↑ ^{Jump up to: 9.0} ^9.1 Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., & Recht, B. (2017). The Marginal Value of Adaptive Gradient Methods in Machine Learning. Advances in Neural Information Processing Systems.
↑ ^{Jump up to: 10.0} ^10.1 ScienceDirect. Artificial Intelligence-Based Predictive Maintenance for Smart Manufacturing Systems. Alexandria Engineering Journal, 2024. Retrieved from [7](https://www.sciencedirect.com/science/article/pii/S1319157824001575).
↑ ^{Jump up to: 11.0} ^11.1 Liu, Z., & Hui, J. (2024). Predictive Maintenance for Industrial Systems: An Integrated Deep Learning and Event Log Approach. Emerald Studies in Reliability Engineering. Retrieved from [8](https://www.emerald.com/insight/content/doi/10.1108/sr-03-2024-0183/full/html).
↑ ^{Jump up to: 12.0} ^12.1 Chong, E., Han, C., & Park, F. C. (2017). Deep Learning Networks for Stock Market Analysis and Prediction: Methodology, Data Representations, and Case Studies. Expert Systems with Applications, vol. 83, 187–205. Retrieved from [9](https://www.sciencedirect.com/science/article/pii/S0378426617302435).
↑ ^{Jump up to: 13.0} ^13.1 Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems.
↑ ^{Jump up to: 14.0} ^14.1 Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the Difficulty of Training Recurrent Neural Networks. International Conference on Machine Learning.

[Zeiler2012-1] {Jump up to: 1.0} ^1.1 ^1.2 ^1.3 ^1.4 Zeiler, M. D. (2012). ADADELTA: An Adaptive Learning Rate Method. arXiv preprint. Retrieved from [1](https://arxiv.org/abs/1212.5701).

[Tieleman2012-2] {Jump up to: 2.0} ^2.1 Tieleman, T., & Hinton, G. (2012). Lecture 6.5 - RMSprop: Divide the Gradient by a Running Average of its Recent Magnitude. COURSERA: Neural Networks for Machine Learning.

[Kingma2014-3] {Jump up to: 3.0} ^3.1 ^3.2 Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980. Retrieved from [2](https://arxiv.org/abs/1412.6980).

[Dozat2016-4] {Jump up to: 4.0} ^4.1 Dozat, T. (2016). Incorporating Nesterov Momentum into Adam. In Proceedings of the International Conference on Learning Representations (ICLR) 2016, Workshop Track. Retrieved from [3](https://openreview.net/forum?id=OM0jvwB8jIp57ZJjtNEZ).

[Goodfellow2016-5] {Jump up to: 5.0} ^5.1 ^5.2 Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.

[TensorFlow-6] {Jump up to: 6.0} ^6.1 TensorFlow Documentation. (n.d.). tf.keras.optimizers.Adadelta. Retrieved from [4](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adadelta).

[PyTorch-7] {Jump up to: 7.0} ^7.1 PyTorch Documentation. (n.d.). torch.optim.Adadelta. Retrieved from [5](https://pytorch.org/docs/stable/generated/torch.optim.Adadelta.html).

[Keras-8] {Jump up to: 8.0} ^8.1 Keras Documentation. (n.d.). Adadelta optimizer. Retrieved from [6](https://keras.io/api/optimizers/adadelta/).

[Wilson2017-9] {Jump up to: 9.0} ^9.1 Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., & Recht, B. (2017). The Marginal Value of Adaptive Gradient Methods in Machine Learning. Advances in Neural Information Processing Systems.

[ScienceDirect2024-10] {Jump up to: 10.0} ^10.1 ScienceDirect. Artificial Intelligence-Based Predictive Maintenance for Smart Manufacturing Systems. Alexandria Engineering Journal, 2024. Retrieved from [7](https://www.sciencedirect.com/science/article/pii/S1319157824001575).

[Liu2024-11] {Jump up to: 11.0} ^11.1 Liu, Z., & Hui, J. (2024). Predictive Maintenance for Industrial Systems: An Integrated Deep Learning and Event Log Approach. Emerald Studies in Reliability Engineering. Retrieved from [8](https://www.emerald.com/insight/content/doi/10.1108/sr-03-2024-0183/full/html).

[Chong2017-12] {Jump up to: 12.0} ^12.1 Chong, E., Han, C., & Park, F. C. (2017). Deep Learning Networks for Stock Market Analysis and Prediction: Methodology, Data Representations, and Case Studies. Expert Systems with Applications, vol. 83, 187–205. Retrieved from [9](https://www.sciencedirect.com/science/article/pii/S0378426617302435).

[Vaswani2017-13] {Jump up to: 13.0} ^13.1 Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems.

[Pascanu2013-14] {Jump up to: 14.0} ^14.1 Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the Difficulty of Training Recurrent Neural Networks. International Conference on Machine Learning.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]