LossScaleOptimizer - Revision history

Fall2024 Wiki Team9 Alter at 22:39, 15 December 2024

2024-12-15T22:39:20Z

← Older revision		Revision as of 18:39, 15 December 2024
Line 60:		Line 60:

	The predicted value is:		The predicted value is:
	[[File:Difference in data precision for FP16 and FP32.png\|thumb\|~~383x383px~~\|'''Fig 2.'''Difference in data precision for FP16 and FP32]]		[[File:Difference in data precision for FP16 and FP32.png\|thumb\|346x346px\|'''Fig 2.'''Difference in data precision for FP16 and FP32]]
	<math>\hat{y} = w \cdot x + b = 0.01 \cdot 0.1 + 0.01 = 0.0101</math>		<math>\hat{y} = w \cdot x + b = 0.01 \cdot 0.1 + 0.01 = 0.0101</math>

Fall2024 Wiki Team9 Alter at 22:38, 15 December 2024

2024-12-15T22:38:10Z

← Older revision		Revision as of 18:38, 15 December 2024
Line 35:		Line 35:


	This cyclical process, guided by loss scaling, guards against the deleterious effects of limited dynamic range in FP16 computations. It has proven effective across a variety of architectures, from convolutional neural networks to transformer-based models, and is frequently combined with dynamic loss scaling techniques or layered precision strategies for enhanced robustness and adaptability<ref name=":4" />. With many deep learning frameworks offering built-in or easily configurable tools for loss scaling, incorporating this approach into the training pipeline has become both more accessible and more reliable.[[File:11.png\|thumb\|~~336x336px~~\|'''Fig 1.'''Process of using Loss Scale Optimizer]]		This cyclical process, guided by loss scaling, guards against the deleterious effects of limited dynamic range in FP16 computations. It has proven effective across a variety of architectures, from convolutional neural networks to transformer-based models, and is frequently combined with dynamic loss scaling techniques or layered precision strategies for enhanced robustness and adaptability<ref name=":4" />. With many deep learning frameworks offering built-in or easily configurable tools for loss scaling, incorporating this approach into the training pipeline has become both more accessible and more reliable.[[File:11.png\|thumb\|382x382px\|'''Fig 1.'''Process of using Loss Scale Optimizer]]


Line 60:		Line 60:

	The predicted value is:		The predicted value is:
			[[File:Difference in data precision for FP16 and FP32.png\|thumb\|383x383px\|'''Fig 2.'''Difference in data precision for FP16 and FP32]]
	<math>\hat{y} = w \cdot x + b = 0.01 \cdot 0.1 + 0.01 = 0.0101</math>		<math>\hat{y} = w \cdot x + b = 0.01 \cdot 0.1 + 0.01 = 0.0101</math>

Fall2024 Wiki Team9 Alter at 06:12, 15 December 2024

2024-12-15T06:12:48Z

← Older revision		Revision as of 02:12, 15 December 2024
Line 10:		Line 10:
	To address these issues, loss scaling multiplies the loss function by a predetermined or adaptively adjusted scaling factor before backpropagation. By doing so, gradients that would otherwise fall below the representable range of FP16 are “lifted” into a stable interval, thus preserving essential gradient information and supporting stable training dynamics<ref name=":3">Jia X, Thomas S, Yao Z, et al. ''Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes.'' arXiv preprint arXiv:1807.11205, 2018.</ref>. Dynamic loss scaling methods can continually optimize this scaling factor during training, adjusting to the evolving conditions within a model’s parameter space and ensuring stable training across diverse architectures and datasets<ref name=":4">Paszke A, Gross S, Massa F, et al. ''PyTorch: An Imperative Style, High-Performance Deep Learning Library.'' Advances in Neural Information Processing Systems, 2019.</ref>.		To address these issues, loss scaling multiplies the loss function by a predetermined or adaptively adjusted scaling factor before backpropagation. By doing so, gradients that would otherwise fall below the representable range of FP16 are “lifted” into a stable interval, thus preserving essential gradient information and supporting stable training dynamics<ref name=":3">Jia X, Thomas S, Yao Z, et al. ''Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes.'' arXiv preprint arXiv:1807.11205, 2018.</ref>. Dynamic loss scaling methods can continually optimize this scaling factor during training, adjusting to the evolving conditions within a model’s parameter space and ensuring stable training across diverse architectures and datasets<ref name=":4">Paszke A, Gross S, Massa F, et al. ''PyTorch: An Imperative Style, High-Performance Deep Learning Library.'' Advances in Neural Information Processing Systems, 2019.</ref>.

	This concept has proven effective in tandem with layered mixed-precision strategies, wherein different layers of the network may employ varying precisions, and also when integrated with integer-based mixed precision methods aimed at enhancing convolutional neural network training. State-of-the-art deep learning frameworks, including PyTorch and TensorFlow, have implemented automated mixed precision capabilities and offer built-in functionalities for loss scaling, streamlining its adoption and refinement in practical applications<ref name=":4" />.		This concept has proven effective in tandem with layered mixed-precision strategies, wherein different layers of the network may employ varying precisions, and also when integrated with integer-based mixed precision methods aimed at enhancing convolutional neural network training<ref>Das D, Mellempudi N, Mudigere D, et al. Mixed precision training of convolutional neural networks using integer operations[J]. arXiv preprint arXiv:1802.00930, 2018.</ref>. State-of-the-art deep learning frameworks, including PyTorch and TensorFlow, have implemented automated mixed precision capabilities and offer built-in functionalities for loss scaling, streamlining its adoption and refinement in practical applications<ref name=":4" />.

	== ''Algorithm Discussion'' ==		== ''Algorithm Discussion'' ==

	The LossScales Optimizer leverages a scaling factor on the loss value to mitigate numerical instability within Mixed Precision Training workflows. By temporarily elevating the loss value prior to gradient computation, gradients calculated under half-precision (FP16) arithmetic are effectively “lifted” into a numerically stable range. Once the gradients are computed, the same scaling factor is applied in reverse—dividing the gradients back to their original magnitudes—ensuring that the weight updates faithfully reflect the intended adjustments. This process allows models to capitalize on the throughput and memory savings of half-precision computations without succumbing to underflow or overflow issues<ref name=":0" />. In practice, such strategies are often automated through frameworks or integrated toolsets, enabling more straightforward adoption and tuning.		The LossScales Optimizer leverages a scaling factor on the loss value to mitigate numerical instability within Mixed Precision Training workflows. By temporarily elevating the loss value prior to gradient computation, gradients calculated under half-precision (FP16) arithmetic are effectively “lifted” into a numerically stable range. Once the gradients are computed, the same scaling factor is applied in reverse—dividing the gradients back to their original magnitudes—ensuring that the weight updates faithfully reflect the intended adjustments. This process allows models to capitalize on the throughput and memory savings of half-precision computations without succumbing to underflow or overflow issues<ref name=":0" />. In practice, such strategies are often automated through frameworks or integrated toolsets, enabling more straightforward adoption and tuning.



	A representation of the algorithm’s procedure is as follows<ref name=":5">Principle of mixed precision and calculation process (AMP)[EB/OL]. [2024-11-29]. <nowiki>https://www.hiascend.com/document/detail/zh/Pytorch/60RC2/ptmoddevg/trainingmigrguide/PT_LMTMOG_0077.html</nowiki>.</ref>:		A representation of the algorithm’s procedure is as follows<ref name=":5">Principle of mixed precision and calculation process (AMP)[EB/OL]. [2024-11-29]. <nowiki>https://www.hiascend.com/document/detail/zh/Pytorch/60RC2/ptmoddevg/trainingmigrguide/PT_LMTMOG_0077.html</nowiki>.</ref>:

Fall2024 Wiki Team9 Alter at 04:48, 15 December 2024

2024-12-15T04:48:08Z

← Older revision		Revision as of 00:48, 15 December 2024
Line 15:		Line 15:

	The LossScales Optimizer leverages a scaling factor on the loss value to mitigate numerical instability within Mixed Precision Training workflows. By temporarily elevating the loss value prior to gradient computation, gradients calculated under half-precision (FP16) arithmetic are effectively “lifted” into a numerically stable range. Once the gradients are computed, the same scaling factor is applied in reverse—dividing the gradients back to their original magnitudes—ensuring that the weight updates faithfully reflect the intended adjustments. This process allows models to capitalize on the throughput and memory savings of half-precision computations without succumbing to underflow or overflow issues<ref name=":0" />. In practice, such strategies are often automated through frameworks or integrated toolsets, enabling more straightforward adoption and tuning.		The LossScales Optimizer leverages a scaling factor on the loss value to mitigate numerical instability within Mixed Precision Training workflows. By temporarily elevating the loss value prior to gradient computation, gradients calculated under half-precision (FP16) arithmetic are effectively “lifted” into a numerically stable range. Once the gradients are computed, the same scaling factor is applied in reverse—dividing the gradients back to their original magnitudes—ensuring that the weight updates faithfully reflect the intended adjustments. This process allows models to capitalize on the throughput and memory savings of half-precision computations without succumbing to underflow or overflow issues<ref name=":0" />. In practice, such strategies are often automated through frameworks or integrated toolsets, enabling more straightforward adoption and tuning.




	A representation of the algorithm’s procedure is as follows<ref name=":5">Principle of mixed precision and calculation process (AMP)[EB/OL]. [2024-11-29]. <nowiki>https://www.hiascend.com/document/detail/zh/Pytorch/60RC2/ptmoddevg/trainingmigrguide/PT_LMTMOG_0077.html</nowiki>.</ref>:		A representation of the algorithm’s procedure is as follows<ref name=":5">Principle of mixed precision and calculation process (AMP)[EB/OL]. [2024-11-29]. <nowiki>https://www.hiascend.com/document/detail/zh/Pytorch/60RC2/ptmoddevg/trainingmigrguide/PT_LMTMOG_0077.html</nowiki>.</ref>:




Line 128:		Line 124:
	=== '''Self-Supervised Learning''' ===		=== '''Self-Supervised Learning''' ===
	Self-supervised learning involves extracting meaningful representations from large volumes of unlabeled data. Although this paradigm fosters flexible and scalable model training, it can introduce gradient instabilities due to increased complexity and the absence of reliable supervisory signals<ref name=":7">Liu Q, Millis B A, Asad Z, et al. Integrate memory efficiency methods for self-supervised learning on pathological image analysis[C]//Medical Imaging 2022: Image Processing. SPIE, 2022, 12032: 695-701.</ref>. Implementing Loss Scale Optimizer mitigates these issues by preventing gradients from collapsing to zero and ensuring stable convergence under the limited dynamic range of half-precision computations. By maintaining numerical stability, the optimizer facilitates more efficient model pre-training, enabling models to leverage abundant unlabeled data without succumbing to precision-induced training disruptions.		Self-supervised learning involves extracting meaningful representations from large volumes of unlabeled data. Although this paradigm fosters flexible and scalable model training, it can introduce gradient instabilities due to increased complexity and the absence of reliable supervisory signals<ref name=":7">Liu Q, Millis B A, Asad Z, et al. Integrate memory efficiency methods for self-supervised learning on pathological image analysis[C]//Medical Imaging 2022: Image Processing. SPIE, 2022, 12032: 695-701.</ref>. Implementing Loss Scale Optimizer mitigates these issues by preventing gradients from collapsing to zero and ensuring stable convergence under the limited dynamic range of half-precision computations. By maintaining numerical stability, the optimizer facilitates more efficient model pre-training, enabling models to leverage abundant unlabeled data without succumbing to precision-induced training disruptions.

	~~'''Large-scale neural network training'''~~

	=== '''Large-scale neural network training''' ===		=== '''Large-scale neural network training''' ===

Fall2024 Wiki Team9 Alter at 04:46, 15 December 2024

2024-12-15T04:46:13Z

Fall2024 Wiki Team9 Alter at 04:41, 15 December 2024

2024-12-15T04:41:28Z

Show changes

Fall2024 Wiki Team9 Alter at 03:47, 15 December 2024

2024-12-15T03:47:04Z

Show changes

Fall2024 Wiki Team9 Alter at 01:25, 15 December 2024

2024-12-15T01:25:56Z

← Older revision		Revision as of 21:25, 14 December 2024
Line 126:		Line 126:
	{\| class="wikitable"		{\| class="wikitable"
	\|+		\|+
	\|工具		\|Tool
	\|描述		\|Description
	\|-		\|-
	\|TensorFlow		\|TensorFlow
	\|TensorFlow 使用 Loss Scale Optimizer ~~来确保梯度更新的稳定性。tf~~.keras.~~mixed_precision~~ API ~~可以自动处理混合精度训练~~		\|TensorFlow uses the Loss Scale Optimizer to ensure stability of gradient updates. tf.keras.mixed\_precision API can automatically handle mixed precision training
	\|-		\|-
	\|PyTorch 插件		\|PyTorch
	\|PyTorch ~~支持混合精度训练，并提供~~ API torch.cuda.amp ~~用于混合精度训练。其中，组件~~ GradScaler ~~实现了~~ Loss Scale Optimizer ~~的功能。~~		\|PyTorch supports mixed-precision training and provides an API torch.cuda.amp for mixed-precision training. Among them, the component GradScaler implements the function of Loss Scale Optimizer.
	\|-		\|-
	\|NVIDIA Apex		\|NVIDIA Apex
	\|Apex 是 NVIDIA ~~的 PyTorch 扩展库，专门用于加速深度学习训练。它包括~~ LossScaler ~~作为处理混合精度训练的关键组件。~~		\|Apex is a PyTorch extension library from NVIDIA specifically designed to accelerate deep learning training. It includes LossScaler as a key component for handling mixed-precision training.
	\|-		\|-
	\|~~深度速度~~		\|DeepSpeed
	\|DeepSpeed 是由 Microsoft ~~开发的深度学习优化库。它支持在大规模训练中使用混合精度训练，并能够通过~~ LossScaleOptimizer ~~进一步提高训练的稳定性和性能。~~		\|DeepSpeed is a deep learning optimisation library developed by Microsoft. It supports the use of mixed-precision training in large-scale training and is able to further improve the stability and performance of training through LossScaleOptimizer.
	\|-		\|-
	\|Microsoft Azure ~~机器学习~~		\|Microsoft Azure ML
	\|在 Azure Cloud Platform ~~中，用户可以通过~~ AzureML SDK ~~实现混合精度训练。~~		\|In the Azure Cloud Platform, users can implement mixed precision training through the AzureML SDK.
	\|}		\|}

Fall2024 Wiki Team9 Alter at 00:45, 15 December 2024

2024-12-15T00:45:57Z

← Older revision		Revision as of 20:45, 14 December 2024
Line 47:		Line 47:

	<math>\frac{\partial L}{\partial b} = \frac{2}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i)</math>		<math>\frac{\partial L}{\partial b} = \frac{2}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i)</math>



	Suppose the input data <math>x = 0.1</math>, the true label <math>y = 0.2</math>, and the initial parameters are <math>w = 0.01</math> and <chem>b</chem><math> = 0.01</math>.		Suppose the input data <math>x = 0.1</math>, the true label <math>y = 0.2</math>, and the initial parameters are <math>w = 0.01</math> and <chem>b</chem><math> = 0.01</math>.
Line 56:		Line 54:
	<math>\hat{y} = w \cdot x + b = 0.01 \cdot 0.1 + 0.01 = 0.0101</math>		<math>\hat{y} = w \cdot x + b = 0.01 \cdot 0.1 + 0.01 = 0.0101</math>

			=== '''Without Loss Scale''' ===

	'''Without Loss Scale'''



	Then the loss value is:		Then the loss value is:
Line 87:		Line 81:
	If the model is trained at FP16, there will be gradient underflow problems with the value of <math>w_{\text{new}}</math>. This value will be approximated as 0.01004.		If the model is trained at FP16, there will be gradient underflow problems with the value of <math>w_{\text{new}}</math>. This value will be approximated as 0.01004.

			=== '''With Loss Scale''' ===

	'''With Loss Scale'''



	To avoid gradient underflow, we introduce Loss Scale, assuming that the Loss Scale factor <math>s = 1024</math> is used.		To avoid gradient underflow, we introduce Loss Scale, assuming that the Loss Scale factor <math>s = 1024</math> is used.
Line 108:		Line 98:
	\frac{\partial L_{\text{scaled}}}{\partial b} = \frac{\partial L}{\partial b} \times s = -0.3798 \times 1024 = -388.9152		\frac{\partial L_{\text{scaled}}}{\partial b} = \frac{\partial L}{\partial b} \times s = -0.3798 \times 1024 = -388.9152
	</math>		</math>



	Using scaled gradient values for <math>w</math> update:		Using scaled gradient values for <math>w</math> update:
Line 125:		Line 113:

	= ''Applications'' =		= ''Applications'' =
	~~'''Mixed Precision Training'''~~

			=== '''Mixed Precision Training''' ===
	When training with FP16, calculations are faster because FP16 data takes up less memory resources. But this can also lead to loss of numerical accuracy because of gradient underflow<ref>Mellempudi N, Srinivasan S, Das D, et al. Mixed precision training with 8-bit floating point[J]. arXiv preprint arXiv:1905.12334, 2019.</ref>. Loss Scale Optimizer could be used to ensure the training stability by scaling the loss function. It could avoid gradients gets too small when calculating. For example, when training a large convolutional neural network, if FP16 is used to accelerate the computation, Loss Scale will dynamically scale the loss to keep the gradient computation in a reasonable range.		When training with FP16, calculations are faster because FP16 data takes up less memory resources. But this can also lead to loss of numerical accuracy because of gradient underflow<ref>Mellempudi N, Srinivasan S, Das D, et al. Mixed precision training with 8-bit floating point[J]. arXiv preprint arXiv:1905.12334, 2019.</ref>. Loss Scale Optimizer could be used to ensure the training stability by scaling the loss function. It could avoid gradients gets too small when calculating. For example, when training a large convolutional neural network, if FP16 is used to accelerate the computation, Loss Scale will dynamically scale the loss to keep the gradient computation in a reasonable range.

	'''Self-Supervised Learning'''		=== '''Self-Supervised Learning''' ===

	Self-supervised learning methods use a large amount of unlabelled data during training, which may cause instability during gradient computation<ref>Liu Q, Millis B A, Asad Z, et al. Integrate memory efficiency methods for self-supervised learning on pathological image analysis[C]//Medical Imaging 2022: Image Processing. SPIE, 2022, 12032: 695-701.</ref>. Loss Scale Optimizer helps to adjust the scale of the loss function during training, avoiding instability caused by lack of precision, and ensuring that the network can converge smoothly.		Self-supervised learning methods use a large amount of unlabelled data during training, which may cause instability during gradient computation<ref>Liu Q, Millis B A, Asad Z, et al. Integrate memory efficiency methods for self-supervised learning on pathological image analysis[C]//Medical Imaging 2022: Image Processing. SPIE, 2022, 12032: 695-701.</ref>. Loss Scale Optimizer helps to adjust the scale of the loss function during training, avoiding instability caused by lack of precision, and ensuring that the network can converge smoothly.

	'''Large-scale neural network training'''		=== '''Large-scale neural network training''' ===

	When training large-scale neural networks (e.g., large models such as GPT, BERT, etc.), the model parameters and computation volume are very large, and the training will encounter memory and computational resource limitations<ref>Nandakumar S R, Le Gallo M, Piveteau C, et al. Mixed-precision deep learning based on computational memory[J]. Frontiers in neuroscience, 2020, 14: 406.</ref>. By using Loss Scale Optimizer, we can avoid the instability of gradient computation due to precision limitation<ref>Li H, Wang Y, Hong Y, et al. Layered mixed-precision training: a new training method for large-scale AI models[J]. Journal of King Saud University-Computer and Information Sciences, 2023, 35(8): 101656.</ref>.		When training large-scale neural networks (e.g., large models such as GPT, BERT, etc.), the model parameters and computation volume are very large, and the training will encounter memory and computational resource limitations<ref>Nandakumar S R, Le Gallo M, Piveteau C, et al. Mixed-precision deep learning based on computational memory[J]. Frontiers in neuroscience, 2020, 14: 406.</ref>. By using Loss Scale Optimizer, we can avoid the instability of gradient computation due to precision limitation<ref>Li H, Wang Y, Hong Y, et al. Layered mixed-precision training: a new training method for large-scale AI models[J]. Journal of King Saud University-Computer and Information Sciences, 2023, 35(8): 101656.</ref>.

	'''Commonly Used Tools With Loss Scale Optimizer'''		=== '''Commonly Used Tools With Loss Scale Optimizer''' ===
	{\| class="wikitable"		{\| class="wikitable"
	\|+		\|+

Fall2024 Wiki Team9 Alter at 00:42, 15 December 2024

2024-12-15T00:42:52Z

← Older revision		Revision as of 20:42, 14 December 2024
Line 4:		Line 4:

	== ''Introduction'' ==		== ''Introduction'' ==
	Loss Scale Optimizer mainly used to deal with numerical stability problems in Mixed Precision Training (MPT) in deep learning models. Mixed Precision Training involves using both lower-precision (float16) and standard precision (float32) data types, which allows for faster training and reduced memory usage without sacrificing model accuracy<~~sup~~>[1]</~~sup~~>. However, the smaller dynamic range of FP16 may result in numerical overflow (i.e. the size of the result of the calculation is smaller than the smallest number that can be represented by a floating-point number), causing the gradients to become zero and preventing proper learning<~~sup~~>[2]</~~sup~~>.		Loss Scale Optimizer mainly used to deal with numerical stability problems in Mixed Precision Training (MPT) in deep learning models. Mixed Precision Training involves using both lower-precision (float16) and standard precision (float32) data types, which allows for faster training and reduced memory usage without sacrificing model accuracy<ref>Micikevicius P, Narang S, Alben J, et al. Mixed precision training[J]. arXiv preprint arXiv:1710.03740, 2017.</ref>. However, the smaller dynamic range of FP16 may result in numerical overflow (i.e. the size of the result of the calculation is smaller than the smallest number that can be represented by a floating-point number), causing the gradients to become zero and preventing proper learning<ref>Li H, Wang Y, Hong Y, et al. Layered mixed-precision training: a new training method for large-scale AI models[J]. Journal of King Saud University-Computer and Information Sciences, 2023, 35(8): 101656.</ref>.

	Loss Scaling works by multiplying the loss value by a scaling factor (Loss Scale Factor) when calculating the loss function value and backpropagation. The purpose is to scale gradients that may be too small in FP16 format to a range that FP16 can represent, thus avoiding numerical underflow<~~sup~~>[3]</~~sup~~>.		Loss Scaling works by multiplying the loss value by a scaling factor (Loss Scale Factor) when calculating the loss function value and backpropagation. The purpose is to scale gradients that may be too small in FP16 format to a range that FP16 can represent, thus avoiding numerical underflow<ref>Das D, Mellempudi N, Mudigere D, et al. Mixed precision training of convolutional neural networks using integer operations[J]. arXiv preprint arXiv:1802.00930, 2018.</ref>.

	== ''Algorithm Discussion'' ==		== ''Algorithm Discussion'' ==



Line 23:		Line 24:
	5. Parameter gradient divided by scaling factor.		5. Parameter gradient divided by scaling factor.

	6. Update the model parameters of float32 using the gradient of float16<~~sup~~>[4]</~~sup~~>.		6. Update the model parameters of float32 using the gradient of float16<ref>Principle of mixed precision and calculation process (AMP)[EB/OL]. [2024-11-29]. <nowiki>https://www.hiascend.com/document/detail/zh/Pytorch/60RC2/ptmoddevg/trainingmigrguide/PT_LMTMOG_0077.html</nowiki>.</ref>.

	[[File:11.png\|thumb\|336x336px]]		[[File:11.png\|thumb\|336x336px\|'''Fig 1.'''Process of using Loss Scale Optimizer]]


Line 46:		Line 47:

	<math>\frac{\partial L}{\partial b} = \frac{2}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i)</math>		<math>\frac{\partial L}{\partial b} = \frac{2}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i)</math>



Line 53:		Line 55:

	<math>\hat{y} = w \cdot x + b = 0.01 \cdot 0.1 + 0.01 = 0.0101</math>		<math>\hat{y} = w \cdot x + b = 0.01 \cdot 0.1 + 0.01 = 0.0101</math>



	'''Without Loss Scale'''		'''Without Loss Scale'''



Line 61:		Line 65:

	<math>L = \frac{1}{2} (\hat{y} - y)^2 = \frac{1}{2} (0.0101 - 0.2)^2 = \frac{1}{2} (-0.1899)^2 = 0.017999</math>		<math>L = \frac{1}{2} (\hat{y} - y)^2 = \frac{1}{2} (0.0101 - 0.2)^2 = \frac{1}{2} (-0.1899)^2 = 0.017999</math>



Line 81:		Line 86:

	If the model is trained at FP16, there will be gradient underflow problems with the value of <math>w_{\text{new}}</math>. This value will be approximated as 0.01004.		If the model is trained at FP16, there will be gradient underflow problems with the value of <math>w_{\text{new}}</math>. This value will be approximated as 0.01004.



	'''With Loss Scale'''		'''With Loss Scale'''



Line 101:		Line 108:
	\frac{\partial L_{\text{scaled}}}{\partial b} = \frac{\partial L}{\partial b} \times s = -0.3798 \times 1024 = -388.9152		\frac{\partial L_{\text{scaled}}}{\partial b} = \frac{\partial L}{\partial b} \times s = -0.3798 \times 1024 = -388.9152
	</math>		</math>



Line 119:		Line 127:
	'''Mixed Precision Training'''		'''Mixed Precision Training'''

	When training with FP16, calculations are faster because FP16 data takes up less memory resources. But this can also lead to loss of numerical accuracy because of gradient underflow<~~sup~~>[5]</~~sup~~>. Loss Scale Optimizer could be used to ensure the training stability by scaling the loss function. It could avoid gradients gets too small when calculating. For example, when training a large convolutional neural network, if FP16 is used to accelerate the computation, Loss Scale will dynamically scale the loss to keep the gradient computation in a reasonable range.		When training with FP16, calculations are faster because FP16 data takes up less memory resources. But this can also lead to loss of numerical accuracy because of gradient underflow<ref>Mellempudi N, Srinivasan S, Das D, et al. Mixed precision training with 8-bit floating point[J]. arXiv preprint arXiv:1905.12334, 2019.</ref>. Loss Scale Optimizer could be used to ensure the training stability by scaling the loss function. It could avoid gradients gets too small when calculating. For example, when training a large convolutional neural network, if FP16 is used to accelerate the computation, Loss Scale will dynamically scale the loss to keep the gradient computation in a reasonable range.

	'''Self-Supervised Learning'''		'''Self-Supervised Learning'''

	Self-supervised learning methods use a large amount of unlabelled data during training, which may cause instability during gradient computation<~~sup~~>[6]</~~sup~~>. Loss Scale Optimizer helps to adjust the scale of the loss function during training, avoiding instability caused by lack of precision, and ensuring that the network can converge smoothly.		Self-supervised learning methods use a large amount of unlabelled data during training, which may cause instability during gradient computation<ref>Liu Q, Millis B A, Asad Z, et al. Integrate memory efficiency methods for self-supervised learning on pathological image analysis[C]//Medical Imaging 2022: Image Processing. SPIE, 2022, 12032: 695-701.</ref>. Loss Scale Optimizer helps to adjust the scale of the loss function during training, avoiding instability caused by lack of precision, and ensuring that the network can converge smoothly.

	'''Large-scale neural network training'''		'''Large-scale neural network training'''

	When training large-scale neural networks (e.g., large models such as GPT, BERT, etc.), the model parameters and computation volume are very large, and the training will encounter memory and computational resource limitations<~~sup~~>[7]</~~sup~~>. By using Loss Scale Optimizer, we can avoid the instability of gradient computation due to precision limitation<~~sup~~>[8]</~~sup~~>.		When training large-scale neural networks (e.g., large models such as GPT, BERT, etc.), the model parameters and computation volume are very large, and the training will encounter memory and computational resource limitations<ref>Nandakumar S R, Le Gallo M, Piveteau C, et al. Mixed-precision deep learning based on computational memory[J]. Frontiers in neuroscience, 2020, 14: 406.</ref>. By using Loss Scale Optimizer, we can avoid the instability of gradient computation due to precision limitation<ref>Li H, Wang Y, Hong Y, et al. Layered mixed-precision training: a new training method for large-scale AI models[J]. Journal of King Saud University-Computer and Information Sciences, 2023, 35(8): 101656.</ref>.

	'''Commonly Used Tools With Loss Scale Optimizer'''		'''Commonly Used Tools With Loss Scale Optimizer'''
Line 155:		Line 163:

	= ''References'' =		= ''References'' =
	~~1. Micikevicius P, Narang S, Alben J, et al. Mixed precision training[J]. arXiv preprint arXiv:1710.03740, 2017.~~		<references />

	2. Li H, Wang Y, Hong Y, et al. Layered mixed-precision training: a new training method for large-scale AI models[J]. Journal of King Saud University-Computer and Information Sciences, 2023, 35(8): 101656.

	~~3. Das D, Mellempudi N, Mudigere D, et al. Mixed precision training of convolutional neural networks using integer operations[J]. arXiv preprint arXiv:1802.00930, 2018.~~

	~~4. Principle of mixed precision and calculation process (AMP)[EB/OL]. [2024-11-29].~~ <~~nowiki>https://www.hiascend.com/document/detail/zh~~/~~Pytorch/60RC2/ptmoddevg/trainingmigrguide/PT_LMTMOG_0077.html</nowiki~~>.

	~~5. Mellempudi N, Srinivasan S, Das D, et al. Mixed precision training with 8-bit floating point[J]. arXiv preprint arXiv:1905.12334, 2019.~~

	6. Liu Q, Millis B A, Asad Z, et al. Integrate memory efficiency methods for self-supervised learning on pathological image analysis[C]//Medical Imaging 2022: Image Processing. SPIE, 2022, 12032: 695-701.

	~~7. Nandakumar S R, Le Gallo M, Piveteau C, et al. Mixed-precision deep learning based on computational memory[J]. Frontiers in neuroscience, 2020, 14: 406.~~

	8. Li H, Wang Y, Hong Y, et al. Layered mixed-precision training: a new training method for large-scale AI models[J]. Journal of King Saud University-Computer and Information Sciences, 2023, 35(8): 101656.

← Older revision		Revision as of 00:46, 15 December 2024
Line 4:		Line 4:

	== ''Introduction'' ==		== ''Introduction'' ==
	<ref name=":0">Micikevicius P, Narang S, Alben J, et al. Mixed precision training[J]. arXiv preprint arXiv:1710.03740, 2017.</ref><ref name=":1">Li H, Wang Y, Hong Y, et al. Layered mixed-precision training: a new training method for large-scale AI models[J]. Journal of King Saud University-Computer and Information Sciences, 2023, 35(8): 101656.</ref><ref>Das D, Mellempudi N, Mudigere D, et al. Mixed precision training of convolutional neural networks using integer operations[J]. arXiv preprint arXiv:1802.00930, 2018.</ref><ref name=":2">NVIDIA Developer Blog. “Mixed Precision Training.” Available: <nowiki>https://developer.nvidia.com/mixed-precision-training</nowiki></ref><ref name=":3">Jia X, Thomas S, Yao Z, et al. ''Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes.'' arXiv preprint arXiv:1807.11205, 2018.</ref><ref name=":4">Paszke A, Gross S, Massa F, et al. ''PyTorch: An Imperative Style, High-Performance Deep Learning Library.'' Advances in Neural Information Processing Systems, 2019.</ref><ref name=":5">Principle of mixed precision and calculation process (AMP)[EB/OL]. [2024-11-29]. <nowiki>https://www.hiascend.com/document/detail/zh/Pytorch/60RC2/ptmoddevg/trainingmigrguide/PT_LMTMOG_0077.html</nowiki>.</ref><ref name=":6">Mellempudi N, Srinivasan S, Das D, et al. Mixed precision training with 8-bit floating point[J]. arXiv preprint arXiv:1905.12334, 2019.</ref><ref name=":7">Liu Q, Millis B A, Asad Z, et al. Integrate memory efficiency methods for self-supervised learning on pathological image analysis[C]//Medical Imaging 2022: Image Processing. SPIE, 2022, 12032: 695-701.</ref><ref name=":8">Nandakumar S R, Le Gallo M, Piveteau C, et al. Mixed-precision deep learning based on computational memory[J]. Frontiers in neuroscience, 2020, 14: 406.</ref>A LossScales Optimizer is a technique designed to maintain numerical stability in Mixed Precision Training (MPT) environments by systematically adjusting the magnitude of the loss value to ensure that gradients remain within representable ranges, thereby preventing underflow or overflow in half-precision (FP16) computations<ref name=":0" />. Mixed Precision Training involves using both standard-precision (FP32) and half-precision (FP16) floating-point numbers within the same model, a practice that can significantly reduce computation time and memory usage without substantially degrading model accurac<ref name=":2" />. This balanced approach enables large-scale and high-complexity neural networks to be trained more efficiently, facilitating rapid experimentation and deployment.		A LossScales Optimizer is a technique designed to maintain numerical stability in Mixed Precision Training (MPT) environments by systematically adjusting the magnitude of the loss value to ensure that gradients remain within representable ranges, thereby preventing underflow or overflow in half-precision (FP16) computations<ref name=":0">Micikevicius P, Narang S, Alben J, et al. Mixed precision training[J]. arXiv preprint arXiv:1710.03740, 2017.</ref>. Mixed Precision Training involves using both standard-precision (FP32) and half-precision (FP16) floating-point numbers within the same model, a practice that can significantly reduce computation time and memory usage without substantially degrading model accurac<ref name=":2">NVIDIA Developer Blog. “Mixed Precision Training.” Available: <nowiki>https://developer.nvidia.com/mixed-precision-training</nowiki></ref>. This balanced approach enables large-scale and high-complexity neural networks to be trained more efficiently, facilitating rapid experimentation and deployment.

	However, the limited dynamic range of FP16 arithmetic introduces numerical challenges. When gradients or intermediate values become exceedingly small or large, they may underflow—collapsing to zero and halting effective learning—or overflow—producing NaN (Not a Number) values that impede convergence<ref name=":1" />. These problems are especially pronounced in larger or more sensitive models, where unstable gradients can undermine training progress and final model quality.		However, the limited dynamic range of FP16 arithmetic introduces numerical challenges. When gradients or intermediate values become exceedingly small or large, they may underflow—collapsing to zero and halting effective learning—or overflow—producing NaN (Not a Number) values that impede convergence<ref name=":1">Li H, Wang Y, Hong Y, et al. Layered mixed-precision training: a new training method for large-scale AI models[J]. Journal of King Saud University-Computer and Information Sciences, 2023, 35(8): 101656.</ref>. These problems are especially pronounced in larger or more sensitive models, where unstable gradients can undermine training progress and final model quality.

	To address these issues, loss scaling multiplies the loss function by a predetermined or adaptively adjusted scaling factor before backpropagation. By doing so, gradients that would otherwise fall below the representable range of FP16 are “lifted” into a stable interval, thus preserving essential gradient information and supporting stable training dynamics<ref name=":3" />. Dynamic loss scaling methods can continually optimize this scaling factor during training, adjusting to the evolving conditions within a model’s parameter space and ensuring stable training across diverse architectures and datasets<ref name=":4" />.		To address these issues, loss scaling multiplies the loss function by a predetermined or adaptively adjusted scaling factor before backpropagation. By doing so, gradients that would otherwise fall below the representable range of FP16 are “lifted” into a stable interval, thus preserving essential gradient information and supporting stable training dynamics<ref name=":3">Jia X, Thomas S, Yao Z, et al. ''Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes.'' arXiv preprint arXiv:1807.11205, 2018.</ref>. Dynamic loss scaling methods can continually optimize this scaling factor during training, adjusting to the evolving conditions within a model’s parameter space and ensuring stable training across diverse architectures and datasets<ref name=":4">Paszke A, Gross S, Massa F, et al. ''PyTorch: An Imperative Style, High-Performance Deep Learning Library.'' Advances in Neural Information Processing Systems, 2019.</ref>.

	This concept has proven effective in tandem with layered mixed-precision strategies, wherein different layers of the network may employ varying precisions, and also when integrated with integer-based mixed precision methods aimed at enhancing convolutional neural network training. State-of-the-art deep learning frameworks, including PyTorch and TensorFlow, have implemented automated mixed precision capabilities and offer built-in functionalities for loss scaling, streamlining its adoption and refinement in practical applications<ref name=":4" />.		This concept has proven effective in tandem with layered mixed-precision strategies, wherein different layers of the network may employ varying precisions, and also when integrated with integer-based mixed precision methods aimed at enhancing convolutional neural network training. State-of-the-art deep learning frameworks, including PyTorch and TensorFlow, have implemented automated mixed precision capabilities and offer built-in functionalities for loss scaling, streamlining its adoption and refinement in practical applications<ref name=":4" />.
Line 19:		Line 19:


	A representation of the algorithm’s procedure is as follows<ref name=":5" />:		A representation of the algorithm’s procedure is as follows<ref name=":5">Principle of mixed precision and calculation process (AMP)[EB/OL]. [2024-11-29]. <nowiki>https://www.hiascend.com/document/detail/zh/Pytorch/60RC2/ptmoddevg/trainingmigrguide/PT_LMTMOG_0077.html</nowiki>.</ref>:


Line 124:		Line 124:

	=== '''Mixed Precision Training''' ===		=== '''Mixed Precision Training''' ===
	Mixed Precision Training leverages half-precision floating-point formats (FP16), significantly reducing memory usage and accelerating computational throughput. However, directly employing FP16 arithmetic without additional measures can lead to numerical instability, such as gradient underflow, especially in large and complex networks<ref name=":8" />. By applying a Loss Scale Optimizer, the training process incorporates a dynamic scaling factor to the loss function. This step ensures that gradients, which might otherwise vanish due to limited precision, are maintained within a stable and representable range. As a result, practitioners can achieve faster training and reduced resource consumption without sacrificing model quality or stability<ref name=":2" />. For example, when training large convolutional neural networks, scaling the loss value helps keep the gradient computations numerically robust, even under high-complexity workloads that push computational boundaries.		Mixed Precision Training leverages half-precision floating-point formats (FP16), significantly reducing memory usage and accelerating computational throughput. However, directly employing FP16 arithmetic without additional measures can lead to numerical instability, such as gradient underflow, especially in large and complex networks<ref name=":8">Nandakumar S R, Le Gallo M, Piveteau C, et al. Mixed-precision deep learning based on computational memory[J]. Frontiers in neuroscience, 2020, 14: 406.</ref>. By applying a Loss Scale Optimizer, the training process incorporates a dynamic scaling factor to the loss function. This step ensures that gradients, which might otherwise vanish due to limited precision, are maintained within a stable and representable range. As a result, practitioners can achieve faster training and reduced resource consumption without sacrificing model quality or stability<ref name=":2" />. For example, when training large convolutional neural networks, scaling the loss value helps keep the gradient computations numerically robust, even under high-complexity workloads that push computational boundaries.

	=== '''Self-Supervised Learning''' ===		=== '''Self-Supervised Learning''' ===
	Self-supervised learning involves extracting meaningful representations from large volumes of unlabeled data. Although this paradigm fosters flexible and scalable model training, it can introduce gradient instabilities due to increased complexity and the absence of reliable supervisory signals<ref name=":7" />. Implementing Loss Scale Optimizer mitigates these issues by preventing gradients from collapsing to zero and ensuring stable convergence under the limited dynamic range of half-precision computations. By maintaining numerical stability, the optimizer facilitates more efficient model pre-training, enabling models to leverage abundant unlabeled data without succumbing to precision-induced training disruptions.		Self-supervised learning involves extracting meaningful representations from large volumes of unlabeled data. Although this paradigm fosters flexible and scalable model training, it can introduce gradient instabilities due to increased complexity and the absence of reliable supervisory signals<ref name=":7">Liu Q, Millis B A, Asad Z, et al. Integrate memory efficiency methods for self-supervised learning on pathological image analysis[C]//Medical Imaging 2022: Image Processing. SPIE, 2022, 12032: 695-701.</ref>. Implementing Loss Scale Optimizer mitigates these issues by preventing gradients from collapsing to zero and ensuring stable convergence under the limited dynamic range of half-precision computations. By maintaining numerical stability, the optimizer facilitates more efficient model pre-training, enabling models to leverage abundant unlabeled data without succumbing to precision-induced training disruptions.

	'''Large-scale neural network training'''		'''Large-scale neural network training'''

	=== '''Large-scale neural network training''' ===		=== '''Large-scale neural network training''' ===
	Scaling up models to the magnitude of architectures like GPT or BERT expands the frontier of what neural networks can achieve but amplifies the challenges associated with memory constraints and computational overhead<ref name=":6" />. Under these circumstances, mixed precision strategies, combined with Loss Scale Optimizers, help alleviate the instability of gradient computations that emerges from working within narrower floating-point ranges<ref name=":6" />. By continuously adjusting the scaling factor, the optimizer ensures that the gradients retain their representational integrity, even as the model’s size and complexity increase. This approach supports the efficient and stable training of large-scale networks, broadening the scope of advanced deep learning applications and models designed to handle vast and intricate datasets.		Scaling up models to the magnitude of architectures like GPT or BERT expands the frontier of what neural networks can achieve but amplifies the challenges associated with memory constraints and computational overhead<ref name=":6">Mellempudi N, Srinivasan S, Das D, et al. Mixed precision training with 8-bit floating point[J]. arXiv preprint arXiv:1905.12334, 2019.</ref>. Under these circumstances, mixed precision strategies, combined with Loss Scale Optimizers, help alleviate the instability of gradient computations that emerges from working within narrower floating-point ranges<ref name=":6" />. By continuously adjusting the scaling factor, the optimizer ensures that the gradients retain their representational integrity, even as the model’s size and complexity increase. This approach supports the efficient and stable training of large-scale networks, broadening the scope of advanced deep learning applications and models designed to handle vast and intricate datasets.

	=== '''Commonly Used Tools With Loss Scale Optimizer''' ===		=== '''Commonly Used Tools With Loss Scale Optimizer''' ===