Adamax - Revision history

Fall2024 Team13 at 19:37, 15 December 2024

2024-12-15T19:37:14Z

← Older revision		Revision as of 15:37, 15 December 2024
Line 103:		Line 103:

	==== Computer Vision ====		==== Computer Vision ====
	Adamax has been applied in image classification and object detection tasks using deep convolutional neural networks (CNNs). For instance, its stability and adaptive learning rate have been shown to improve the training of models like ResNet and EfficientNet.<ref>He, K., Zhang, X., Ren, S., & Sun, J. (2016). [https://arxiv.org/abs/1512.03385 Deep Residual Learning for Image Recognition]. arXiv preprint arXiv:1512.03385.</ref>		Adamax has been applied in image classification and object detection tasks using deep convolutional neural networks (CNNs). For instance, its stability and adaptive learning rate have been shown to improve the training of models like [[wikipedia:Residual_neural_network\|ResNet]] and EfficientNet.<ref>He, K., Zhang, X., Ren, S., & Sun, J. (2016). [https://arxiv.org/abs/1512.03385 Deep Residual Learning for Image Recognition]. arXiv preprint arXiv:1512.03385.</ref>

	==== Reinforcement Learning ====		==== Reinforcement Learning ====
Line 112:		Line 112:

	==== Time Series Prediction ====		==== Time Series Prediction ====
	In time series forecasting tasks, Adamax efficiently handles models with ~~recurrent~~ [[wikipedia:Recurrent_neural_network\|neural networks]] (RNNs) and transformers. It has been applied to tasks like financial prediction and sensor data analysis.<ref>Kingma, D. P., & Ba, J. (2014). [https://arxiv.org/abs/1412.6980 Adam: A Method for Stochastic Optimization]. arXiv preprint arXiv:1412.6980.</ref>		In time series forecasting tasks, Adamax efficiently handles models with [[wikipedia:Recurrent_neural_network\|recurrent neural networks]] (RNNs) and transformers. It has been applied to tasks like financial prediction and sensor data analysis.<ref>Kingma, D. P., & Ba, J. (2014). [https://arxiv.org/abs/1412.6980 Adam: A Method for Stochastic Optimization]. arXiv preprint arXiv:1412.6980.</ref>

	Adamax is preferred in scenarios requiring robust handling of large parameter spaces, sparse gradients, or noisy data. Its wide adoption across different domains highlights its versatility and effectiveness.<ref>Hugging Face Documentation. [https://huggingface.co/docs/transformers Transformers Library].</ref>		Adamax is preferred in scenarios requiring robust handling of large parameter spaces, sparse gradients, or noisy data. Its wide adoption across different domains highlights its versatility and effectiveness.<ref>Hugging Face Documentation. [https://huggingface.co/docs/transformers Transformers Library].</ref>


	== Conclusion ==		== Conclusion ==
	Adamax is a robust and computationally efficient optimization algorithm that builds upon the Adam framework by replacing the second-moment estimate with the infinity norm. This modification simplifies the optimization process and enhances stability, particularly in handling sparse gradients and high-dimensional parameter spaces.<ref>Kingma, D. P., & Ba, J. (2014). [https://arxiv.org/abs/1412.6980 Adam: A Method for Stochastic Optimization]. arXiv preprint arXiv:1412.6980.</ref>		Adamax is a robust and computationally efficient optimization algorithm that builds upon the Adam framework by replacing the second-moment estimate with the infinity norm. This modification simplifies the optimization process and enhances stability, particularly in handling sparse gradients and high-dimensional parameter spaces.<ref>Kingma, D. P., & Ba, J. (2014). [https://arxiv.org/abs/1412.6980 Adam: A Method for Stochastic Optimization]. arXiv preprint arXiv:1412.6980.</ref>

Fall2024 Team13 at 18:59, 15 December 2024

2024-12-15T18:59:17Z

← Older revision		Revision as of 14:59, 15 December 2024
Line 100:		Line 100:

	==== Natural Language Processing (NLP) ====		==== Natural Language Processing (NLP) ====
	Adamax performs well in NLP tasks, such as training word embeddings, text classification, and language modeling. The ability to handle sparse gradients makes it particularly effective in models like BERT and GPT.<ref>Hugging Face Documentation. [https://huggingface.co/docs/transformers Transformers Library].</ref> Its adaptive learning rate mechanism is advantageous for tasks where vocabulary size leads to large parameter spaces.<ref>TensorFlow Documentation. [https://www.tensorflow.org/api_docs/python/tf/keras/optimizers AdaMax Optimizer].</ref>		Adamax performs well in NLP tasks, such as training word embeddings, text classification, and language modeling. The ability to handle sparse gradients makes it particularly effective in models like [[wikipedia:BERT_(language_model)\|BERT]] and [[wikipedia:Generative_pre-trained_transformer\|GPT]].<ref>Hugging Face Documentation. [https://huggingface.co/docs/transformers Transformers Library].</ref> Its adaptive learning rate mechanism is advantageous for tasks where vocabulary size leads to large parameter spaces.<ref>TensorFlow Documentation. [https://www.tensorflow.org/api_docs/python/tf/keras/optimizers AdaMax Optimizer].</ref>

	==== Computer Vision ====		==== Computer Vision ====
Line 109:		Line 109:

	==== Generative Models ====		==== Generative Models ====
	Adamax has been used in training generative adversarial networks (GANs) and variational autoencoders (VAEs). The optimizer helps stabilize the training process, which can be sensitive to gradient updates.<ref>Cornell University. [https://optimization.cbe.cornell.edu/index.php?title=Adamax AdaMax - Computational Optimization Open Textbook].</ref>		Adamax has been used in training [[wikipedia:Generative_adversarial_network\|generative adversarial networks]] (GANs) and [[wikipedia:Variational_autoencoder\|variational autoencoders]] (VAEs). The optimizer helps stabilize the training process, which can be sensitive to gradient updates.<ref>Cornell University. [https://optimization.cbe.cornell.edu/index.php?title=Adamax AdaMax - Computational Optimization Open Textbook].</ref>

	==== Time Series Prediction ====		==== Time Series Prediction ====
	In time series forecasting tasks, Adamax efficiently handles models with recurrent neural networks (RNNs) and transformers. It has been applied to tasks like financial prediction and sensor data analysis.<ref>Kingma, D. P., & Ba, J. (2014). [https://arxiv.org/abs/1412.6980 Adam: A Method for Stochastic Optimization]. arXiv preprint arXiv:1412.6980.</ref>		In time series forecasting tasks, Adamax efficiently handles models with recurrent [[wikipedia:Recurrent_neural_network\|neural networks]] (RNNs) and transformers. It has been applied to tasks like financial prediction and sensor data analysis.<ref>Kingma, D. P., & Ba, J. (2014). [https://arxiv.org/abs/1412.6980 Adam: A Method for Stochastic Optimization]. arXiv preprint arXiv:1412.6980.</ref>

	Adamax is preferred in scenarios requiring robust handling of large parameter spaces, sparse gradients, or noisy data. Its wide adoption across different domains highlights its versatility and effectiveness.<ref>Hugging Face Documentation. [https://huggingface.co/docs/transformers Transformers Library].</ref>		Adamax is preferred in scenarios requiring robust handling of large parameter spaces, sparse gradients, or noisy data. Its wide adoption across different domains highlights its versatility and effectiveness.<ref>Hugging Face Documentation. [https://huggingface.co/docs/transformers Transformers Library].</ref>

Fall2024 Team13 at 18:56, 15 December 2024

2024-12-15T18:56:53Z

← Older revision		Revision as of 14:56, 15 December 2024
Line 118:		Line 118:

	== Conclusion ==		== Conclusion ==
	Adamax is a robust and efficient ~~variant of~~ the Adam ~~optimizer that replaces~~ the ~~RMS norm~~ with the infinity norm. ~~Its ability to handle~~ sparse gradients~~, noisy updates,~~ and ~~large~~ parameter spaces ~~makes it a widely used optimization method in natural language processing~~, ~~computer vision~~, ~~reinforcement learning~~, ~~and generative modeling~~.		Adamax is a robust and computationally efficient optimization algorithm that builds upon the Adam framework by replacing the second-moment estimate with the infinity norm. This modification simplifies the optimization process and enhances stability, particularly in handling sparse gradients and high-dimensional parameter spaces.<ref>Kingma, D. P., & Ba, J. (2014). [https://arxiv.org/abs/1412.6980 Adam: A Method for Stochastic Optimization]. arXiv preprint arXiv:1412.6980.</ref>

	~~Future advancements may involve integrating~~ Adamax with learning rate ~~schedules~~ and ~~regularization techniques~~ to ~~further enhance~~ its ~~performance~~.		The algorithm's versatility makes it suitable for various deep learning tasks, including natural language processing, computer vision, reinforcement learning, generative models, and time series forecasting.<ref>Cornell University. [https://optimization.cbe.cornell.edu/index.php?title=Adamax AdaMax - Computational Optimization Open Textbook].</ref> Its robustness in dealing with sparse gradients, coupled with its adaptive learning rate mechanism, has contributed to its adoption in many state-of-the-art machine learning frameworks, such as TensorFlow and PyTorch.<ref>TensorFlow Documentation. [https://www.tensorflow.org/api_docs/python/tf/keras/optimizers AdaMax Optimizer].</ref><ref>PyTorch Documentation. [https://pytorch.org/docs/stable/optim.html AdaMax Optimizer].</ref>

			Adamax’s ability to balance simplicity and performance ensures its ongoing relevance in optimizing complex models across diverse applications.<ref>Hugging Face Documentation. [https://huggingface.co/docs/transformers Transformers Library].</ref>

	== References ==		== References ==

Fall2024 Team13 at 18:55, 15 December 2024

2024-12-15T18:55:05Z

← Older revision		Revision as of 14:55, 15 December 2024
Line 97:		Line 97:
	== Applications ==		== Applications ==

	~~=== Natural Language Processing ===~~		Adamax has been widely used in various machine learning and deep learning tasks due to its robustness in handling sparse gradients and its computational efficiency.<ref>Kingma, D. P., & Ba, J. (2014). [https://arxiv.org/abs/1412.6980 Adam: A Method for Stochastic Optimization]. arXiv preprint arXiv:1412.6980.</ref> Some key application areas include:
	Adamax ~~is particularly effective~~ in ~~training transformer-based models like [[wikipedia:BERT_~~(~~language_model~~)~~\|BERT]] and~~ [~~[wikipedia~~:~~Generative_pre-trained_transformer\|GPT]~~]. ~~Its stability with sparse gradients makes it ideal for tasks such as text classification, machine translation, and named entity recognition~~.

	=== ~~Computer Vision~~ ===		==== Natural Language Processing (NLP) ====
	~~In computer vision~~, ~~Adamax optimizes deep [~~[~~wikipedia~~:~~Convolutional_neural_network\|CNN]~~]<~~nowiki~~/>s for tasks ~~like image classification and object detection~~. ~~Its smooth convergence behavior has been observed to enhance performance in models like [~~[~~wikipedia~~:~~Residual_neural_network\|ResNet]~~] ~~and DenseNet~~.		Adamax performs well in NLP tasks, such as training word embeddings, text classification, and language modeling. The ability to handle sparse gradients makes it particularly effective in models like BERT and GPT.<ref>Hugging Face Documentation. [https://huggingface.co/docs/transformers Transformers Library].</ref> Its adaptive learning rate mechanism is advantageous for tasks where vocabulary size leads to large parameter spaces.<ref>TensorFlow Documentation. [https://www.tensorflow.org/api_docs/python/tf/keras/optimizers AdaMax Optimizer].</ref>

	=== ~~Reinforcement Learning~~ ===		==== Computer Vision ====
	Adamax has been applied in training ~~reinforcement learning agents~~, ~~particularly in environments where gradient updates are inconsistent or noisy~~, ~~such as robotic control and policy optimization~~.		Adamax has been applied in image classification and object detection tasks using deep convolutional neural networks (CNNs). For instance, its stability and adaptive learning rate have been shown to improve the training of models like ResNet and EfficientNet.<ref>He, K., Zhang, X., Ren, S., & Sun, J. (2016). [https://arxiv.org/abs/1512.03385 Deep Residual Learning for Image Recognition]. arXiv preprint arXiv:1512.03385.</ref>

	=== ~~Generative Models~~ ===		==== Reinforcement Learning ====
	~~For training generative models~~, ~~including [[wikipedia:Generative_adversarial_network\|GAN]]~~<~~nowiki/~~>~~s and [~~[~~wikipedia~~:~~Variational_autoencoder\|VAE]~~]<~~nowiki~~/>~~s, Adamax provides robust optimization, improving stability and output quality during adversarial training.~~		Adamax is particularly useful in reinforcement learning tasks, where it optimizes policy and value networks. Its robustness ensures stable convergence even with noisy and sparse reward signals.<ref>PyTorch Documentation. [https://pytorch.org/docs/stable/optim.html AdaMax Optimizer].</ref>

	=== ~~Time-Series Forecasting~~ ===		==== Generative Models ====
	Adamax is used in ~~financial~~ and ~~economic forecasting~~, ~~where it handles noisy gradients effectively, resulting in stable and accurate time~~-~~series predictions~~.		Adamax has been used in training generative adversarial networks (GANs) and variational autoencoders (VAEs). The optimizer helps stabilize the training process, which can be sensitive to gradient updates.<ref>Cornell University. [https://optimization.cbe.cornell.edu/index.php?title=Adamax AdaMax - Computational Optimization Open Textbook].</ref>

	=== ~~Advantages over Other Approaches~~ ===		==== Time Series Prediction ====
	*Stability: The use of the infinity norm ensures Adamax handles ~~gradient variations smoothly~~.		In time series forecasting tasks, Adamax efficiently handles models with recurrent neural networks (RNNs) and transformers. It has been applied to tasks like financial prediction and sensor data analysis.<ref>Kingma, D. P., & Ba, J. (2014). [https://arxiv.org/abs/1412.6980 Adam: A Method for Stochastic Optimization]. arXiv preprint arXiv:1412.6980.</ref>

	*Sparse Gradient Handling: Adamax is ~~robust~~ in scenarios ~~with zero or near-zero~~ gradients, ~~common in NLP tasks~~.		Adamax is preferred in scenarios requiring robust handling of large parameter spaces, sparse gradients, or noisy data. Its wide adoption across different domains highlights its versatility and effectiveness.<ref>Hugging Face Documentation. [https://huggingface.co/docs/transformers Transformers Library].</ref>

	*Efficiency: ~~Adamax is computationally efficient for high-dimensional optimization problems~~.

	== Conclusion ==		== Conclusion ==

Fall2024 Team13 at 18:53, 15 December 2024

2024-12-15T18:53:06Z

← Older revision		Revision as of 14:53, 15 December 2024
Line 4:		Line 4:

	== Introduction ==		== Introduction ==
	Adamax is a variant of the Adam optimization algorithm, introduced by Kingma and Ba in 2014. It modifies the adaptive learning rate mechanism of Adam by replacing the second-moment estimate with the infinity norm of past gradients. This adjustment simplifies the optimization process and improves stability when working with sparse gradients or parameters with large variations. [~~CITE~~]		Adamax is a variant of the Adam optimization algorithm, introduced by Kingma and Ba in 2014.<ref>Kingma, D. P., & Ba, J. (2014). [https://arxiv.org/abs/1412.6980 Adam: A Method for Stochastic Optimization]. arXiv preprint arXiv:1412.6980.</ref> It modifies the adaptive learning rate mechanism of Adam by replacing the second-moment estimate with the infinity norm of past gradients. This adjustment simplifies the optimization process and improves stability when working with sparse gradients or parameters with large variations.<ref>Cornell University. [https://optimization.cbe.cornell.edu/index.php?title=Adamax AdaMax - Computational Optimization Open Textbook].</ref>

	The algorithm is designed to adaptively adjust the learning rates for each parameter based on the first-moment estimate and the infinity norm of the gradient updates. This is particularly effective in high-dimensional parameter spaces, where the algorithm avoids issues caused by over-reliance on second-moment estimates, as seen in the original Adam algorithm. [~~CITE~~]		The algorithm is designed to adaptively adjust the learning rates for each parameter based on the first-moment estimate and the infinity norm of the gradient updates. This is particularly effective in high-dimensional parameter spaces, where the algorithm avoids issues caused by over-reliance on second-moment estimates, as seen in the original Adam algorithm.<ref>TensorFlow Documentation. [https://www.tensorflow.org/api_docs/python/tf/keras/optimizers AdaMax Optimizer].</ref>

	Adamax is well-suited for tasks involving sparse gradients and has been successfully applied in various fields, including natural language processing, computer vision, and reinforcement learning. Its robustness and computational efficiency make it a preferred choice for optimizing deep learning models. [~~CITE~~]		Adamax is well-suited for tasks involving sparse gradients and has been successfully applied in various fields, including natural language processing, computer vision, and reinforcement learning. Its robustness and computational efficiency make it a preferred choice for optimizing deep learning models.<ref>Hugging Face Documentation. [https://huggingface.co/docs/transformers Transformers Library].</ref>

	== Algorithm Discussion ==		== Algorithm Discussion ==
Line 125:		Line 125:

	== References ==		== References ==
	* Kingma, D. P., & Ba, J. (2014). [Adam: A Method for Stochastic Optimization](https://arxiv.org/abs/1412.6980). arXiv preprint arXiv:1412.6980.
	* He, K., Zhang, X., Ren, S., & Sun, J. (2016). [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385). arXiv preprint arXiv:1512.03385.
	* TensorFlow Documentation. (n.d.). [AdaMax Optimizer](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers).
	* PyTorch Documentation. (n.d.). [AdaMax Optimizer](https://pytorch.org/docs/stable/optim.html).
	* Hugging Face Documentation. (n.d.). [Transformers Library](https://huggingface.co/docs/transformers).

Fall2024 Team13 at 18:48, 15 December 2024

2024-12-15T18:48:04Z

← Older revision		Revision as of 14:48, 15 December 2024
Line 4:		Line 4:

	== Introduction ==		== Introduction ==
	Adamax is an optimization algorithm introduced by Kingma and Ba in ~~their Adam optimizer paper (~~2014). It ~~improves upon~~ the Adam ~~algorithm~~ by replacing the second moment~~'s root mean square (RMS) norm~~ with the infinity norm ~~(<math>\ell_\infty</math>)~~. This ~~change makes Adamax more robust~~ and ~~numerically stable, especially~~ when ~~handling~~ sparse gradients~~, noisy updates,~~ or ~~optimization problems~~ with ~~significant gradient~~ variations.		Adamax is a variant of the Adam optimization algorithm, introduced by Kingma and Ba in 2014. It modifies the adaptive learning rate mechanism of Adam by replacing the second-moment estimate with the infinity norm of past gradients. This adjustment simplifies the optimization process and improves stability when working with sparse gradients or parameters with large variations. [CITE]

	~~Adamax dynamically adjusts~~ learning rates for ~~individual parameters~~, ~~making it~~ well-suited for ~~training~~ deep ~~neural networks, large-scale machine~~ learning models~~, and tasks involving high-dimensional parameter spaces~~.		The algorithm is designed to adaptively adjust the learning rates for each parameter based on the first-moment estimate and the infinity norm of the gradient updates. This is particularly effective in high-dimensional parameter spaces, where the algorithm avoids issues caused by over-reliance on second-moment estimates, as seen in the original Adam algorithm. [CITE]

			Adamax is well-suited for tasks involving sparse gradients and has been successfully applied in various fields, including natural language processing, computer vision, and reinforcement learning. Its robustness and computational efficiency make it a preferred choice for optimizing deep learning models. [CITE]

	== Algorithm Discussion ==		== Algorithm Discussion ==

Fall2024 Team13 at 18:33, 15 December 2024

2024-12-15T18:33:16Z

← Older revision		Revision as of 14:33, 15 December 2024
Line 96:		Line 96:

	=== Natural Language Processing ===		=== Natural Language Processing ===
	Adamax is particularly effective in training transformer-based models like BERT and GPT. Its stability with sparse gradients makes it ideal for tasks such as text classification, machine translation, and named entity recognition.		Adamax is particularly effective in training transformer-based models like [[wikipedia:BERT_(language_model)\|BERT]] and [[wikipedia:Generative_pre-trained_transformer\|GPT]]. Its stability with sparse gradients makes it ideal for tasks such as text classification, machine translation, and named entity recognition.

	=== Computer Vision ===		=== Computer Vision ===
	In computer vision, Adamax optimizes deep ~~CNNs~~ for tasks like image classification and object detection. Its smooth convergence behavior has been observed to enhance performance in models like [[wikipedia:Residual_neural_network\|ResNet]] and DenseNet.		In computer vision, Adamax optimizes deep [[wikipedia:Convolutional_neural_network\|CNN]]<nowiki/>s for tasks like image classification and object detection. Its smooth convergence behavior has been observed to enhance performance in models like [[wikipedia:Residual_neural_network\|ResNet]] and DenseNet.

	=== Reinforcement Learning ===		=== Reinforcement Learning ===
Line 105:		Line 105:

	=== Generative Models ===		=== Generative Models ===
	For training generative models, including ~~GANs~~ and ~~VAEs~~, Adamax provides robust optimization, improving stability and output quality during adversarial training.		For training generative models, including [[wikipedia:Generative_adversarial_network\|GAN]]<nowiki/>s and [[wikipedia:Variational_autoencoder\|VAE]]<nowiki/>s, Adamax provides robust optimization, improving stability and output quality during adversarial training.

	=== Time-Series Forecasting ===		=== Time-Series Forecasting ===

Fall2024 Team13 at 18:20, 15 December 2024

2024-12-15T18:20:53Z

← Older revision		Revision as of 14:20, 15 December 2024
Line 99:		Line 99:

	=== Computer Vision ===		=== Computer Vision ===
	In computer vision, Adamax optimizes deep CNNs for tasks like image classification and object detection. Its smooth convergence behavior has been observed to enhance performance in models like ResNet and DenseNet.		In computer vision, Adamax optimizes deep CNNs for tasks like image classification and object detection. Its smooth convergence behavior has been observed to enhance performance in models like [[wikipedia:Residual_neural_network\|ResNet]] and DenseNet.

	=== Reinforcement Learning ===		=== Reinforcement Learning ===

Fall2024 Team13 at 07:04, 15 December 2024

2024-12-15T07:04:42Z

← Older revision		Revision as of 03:04, 15 December 2024
Line 5:		Line 5:
	== Introduction ==		== Introduction ==
	Adamax is an optimization algorithm introduced by Kingma and Ba in their Adam optimizer paper (2014). It improves upon the Adam algorithm by replacing the second moment's root mean square (RMS) norm with the infinity norm (<math>\ell_\infty</math>). This change makes Adamax more robust and numerically stable, especially when handling sparse gradients, noisy updates, or optimization problems with significant gradient variations.		Adamax is an optimization algorithm introduced by Kingma and Ba in their Adam optimizer paper (2014). It improves upon the Adam algorithm by replacing the second moment's root mean square (RMS) norm with the infinity norm (<math>\ell_\infty</math>). This change makes Adamax more robust and numerically stable, especially when handling sparse gradients, noisy updates, or optimization problems with significant gradient variations.

	Historically, Adamax was introduced as part of the original Adam optimizer paper by Kingma and Ba (2014)<ref>Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980.</ref>. It was presented as a variant of Adam tailored for scenarios where <math>\ell_\infty</math> norms offer computational or numerical advantages over ℓ2 norms.

	Adamax dynamically adjusts learning rates for individual parameters, making it well-suited for training deep neural networks, large-scale machine learning models, and tasks involving high-dimensional parameter spaces.		Adamax dynamically adjusts learning rates for individual parameters, making it well-suited for training deep neural networks, large-scale machine learning models, and tasks involving high-dimensional parameter spaces.
Line 15:		Line 13:
	Given the parameters <math>\theta</math>, a learning rate <math>\alpha</math>, and decay rates <math>\beta_1</math> and <math>\beta_2</math>, Adamax follows these steps:		Given the parameters <math>\theta</math>, a learning rate <math>\alpha</math>, and decay rates <math>\beta_1</math> and <math>\beta_2</math>, Adamax follows these steps:

	=== Initialize: ===		=== Initialize ===
	* Initialize parameters <math>\theta_0</math>, the first-moment estimate <math>m_0 = 0</math>, and the exponentially weighted infinity norm <math>u_0 = 0</math>.		* Initialize parameters <math>\theta_0</math>, the first-moment estimate <math>m_0 = 0</math>, and the exponentially weighted infinity norm <math>u_0 = 0</math>.
	* Set hyperparameters:		* Set hyperparameters:
Line 23:		Line 21:
	<math>\epsilon</math>: Small constant to avoid division by zero		<math>\epsilon</math>: Small constant to avoid division by zero

	=== For each time step : ===		=== For each time step ===
	1. Compute Gradient: <math>g_t = \nabla_{\theta} J(\theta_{t-1})</math>		* Compute Gradient: <math>g_t = \nabla_{\theta} J(\theta_{t-1})</math>

	2. Update First Moment Estimate: <math>m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t</math>		* Update First Moment Estimate: <math>m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t</math>

	3. Update Infinity Norm: <math>u_t = \max(\beta_2 \cdot u_{t-1}, \|g_t\|)</math>		* Update Infinity Norm: <math>u_t = \max(\beta_2 \cdot u_{t-1}, \|g_t\|)</math>

	4. Bias Correction for the First Moment: <math>\hat{m}_t = \frac{m_t}{1 - \beta_1^t}</math>		* Bias Correction for the First Moment: <math>\hat{m}_t = \frac{m_t}{1 - \beta_1^t}</math>

	5. Parameter Update: <math>\theta_t = \theta_{t-1} - \alpha \cdot \frac{\hat{m}_t}{u_t + \epsilon}</math>		* Parameter Update: <math>\theta_t = \theta_{t-1} - \alpha \cdot \frac{\hat{m}_t}{u_t + \epsilon}</math>

	=== Pseudocode for Adamax ===		=== Pseudocode for Adamax ===
Line 62:		Line 60:
	*Initialization: <math>m_0 = 0, u_0 = 0, t = 0</math>		*Initialization: <math>m_0 = 0, u_0 = 0, t = 0</math>
	=== Step-by-Step Calculations ===		=== Step-by-Step Calculations ===
	==== Iteration 1: ====		==== Iteration 1 ====
	<math>t = 1</math>		<math>t = 1</math>
	*Gradient Calculation: <math>g_1 = 2x_0 = 2 \cdot 2.0 = 4.0</math>		*Gradient Calculation: <math>g_1 = 2x_0 = 2 \cdot 2.0 = 4.0</math>
Line 79:		Line 77:
	The parameter moves closer to the function's minimum at <math>x = 0</math>.		The parameter moves closer to the function's minimum at <math>x = 0</math>.

	==== Iteration 2: ====		==== Iteration 2 ====
	<math>t = 2</math>		<math>t = 2</math>
	*Gradient Calculation :<math>g_2 = 2x_1 = 2 \cdot 1.9 = 3.8</math>		*Gradient Calculation :<math>g_2 = 2x_1 = 2 \cdot 1.9 = 3.8</math>

Fall2024 Team13 at 07:03, 15 December 2024

2024-12-15T07:03:08Z

← Older revision		Revision as of 03:03, 15 December 2024
Line 5:		Line 5:
	== Introduction ==		== Introduction ==
	Adamax is an optimization algorithm introduced by Kingma and Ba in their Adam optimizer paper (2014). It improves upon the Adam algorithm by replacing the second moment's root mean square (RMS) norm with the infinity norm (<math>\ell_\infty</math>). This change makes Adamax more robust and numerically stable, especially when handling sparse gradients, noisy updates, or optimization problems with significant gradient variations.		Adamax is an optimization algorithm introduced by Kingma and Ba in their Adam optimizer paper (2014). It improves upon the Adam algorithm by replacing the second moment's root mean square (RMS) norm with the infinity norm (<math>\ell_\infty</math>). This change makes Adamax more robust and numerically stable, especially when handling sparse gradients, noisy updates, or optimization problems with significant gradient variations.

			Historically, Adamax was introduced as part of the original Adam optimizer paper by Kingma and Ba (2014)<ref>Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980.</ref>. It was presented as a variant of Adam tailored for scenarios where <math>\ell_\infty</math> norms offer computational or numerical advantages over ℓ2 norms.

	Adamax dynamically adjusts learning rates for individual parameters, making it well-suited for training deep neural networks, large-scale machine learning models, and tasks involving high-dimensional parameter spaces.		Adamax dynamically adjusts learning rates for individual parameters, making it well-suited for training deep neural networks, large-scale machine learning models, and tasks involving high-dimensional parameter spaces.
Line 13:		Line 15:
	Given the parameters <math>\theta</math>, a learning rate <math>\alpha</math>, and decay rates <math>\beta_1</math> and <math>\beta_2</math>, Adamax follows these steps:		Given the parameters <math>\theta</math>, a learning rate <math>\alpha</math>, and decay rates <math>\beta_1</math> and <math>\beta_2</math>, Adamax follows these steps:

	=== Initialize ===		=== Initialize: ===
	* Initialize parameters <math>\theta_0</math>, the first-moment estimate <math>m_0 = 0</math>, and the exponentially weighted infinity norm <math>u_0 = 0</math>.		* Initialize parameters <math>\theta_0</math>, the first-moment estimate <math>m_0 = 0</math>, and the exponentially weighted infinity norm <math>u_0 = 0</math>.
	* Set hyperparameters:		* Set hyperparameters:
Line 21:		Line 23:
	<math>\epsilon</math>: Small constant to avoid division by zero		<math>\epsilon</math>: Small constant to avoid division by zero

	=== For each time step ===		=== For each time step : ===
	1. Compute Gradient: <math>g_t = \nabla_{\theta} J(\theta_{t-1})</math>		1. Compute Gradient: <math>g_t = \nabla_{\theta} J(\theta_{t-1})</math>

Line 60:		Line 62:
	*Initialization: <math>m_0 = 0, u_0 = 0, t = 0</math>		*Initialization: <math>m_0 = 0, u_0 = 0, t = 0</math>
	=== Step-by-Step Calculations ===		=== Step-by-Step Calculations ===
	==== Iteration 1 ====		==== Iteration 1: ====
	<math>t = 1</math>		<math>t = 1</math>
	*Gradient Calculation: <math>g_1 = 2x_0 = 2 \cdot 2.0 = 4.0</math>		*Gradient Calculation: <math>g_1 = 2x_0 = 2 \cdot 2.0 = 4.0</math>
Line 77:		Line 79:
	The parameter moves closer to the function's minimum at <math>x = 0</math>.		The parameter moves closer to the function's minimum at <math>x = 0</math>.

	==== Iteration 2 ====		==== Iteration 2: ====
	<math>t = 2</math>		<math>t = 2</math>
	*Gradient Calculation :<math>g_2 = 2x_1 = 2 \cdot 1.9 = 3.8</math>		*Gradient Calculation :<math>g_2 = 2x_1 = 2 \cdot 1.9 = 3.8</math>