Author: Han Zeng (hz665), Tianyi Zhou (tz427), Bingzheng Wang (bw537), Regan Zhou (zz755), Emma Burford (ebb92) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

Introduction

Loss Scale Optimizer mainly used to deal with numerical stability problems in Mixed Precision Training (MPT) in deep learning models. Mixed Precision Training involves using both lower-precision (float16) and standard precision (float32) data types, which allows for faster training and reduced memory usage without sacrificing model accuracy^[1]. However, the smaller dynamic range of FP16 may result in numerical overflow (i.e. the size of the result of the calculation is smaller than the smallest number that can be represented by a floating-point number), causing the gradients to become zero and preventing proper learning^[2].

Loss Scaling works by multiplying the loss value by a scaling factor (Loss Scale Factor) when calculating the loss function value and backpropagation. The purpose is to scale gradients that may be too small in FP16 format to a range that FP16 can represent, thus avoiding numerical underflow^[3].

Algorithm Discussion

The core idea is to multiply the loss by a scaling factor before computing the gradients, thereby increasing the gradient values to avoid numerical underflow. Once the gradients are computed, they are divided by the same scaling factor before updating the model weights. This ensures that the gradient values are within a stable numerical range while retaining the original scale of updates.

1. Copy parameters and convert to float16 model precision.

2. Forward propagation (float16 model parameters).

3. Loss times scaling factor.

4. Backpropagation (model parameters of float16 with parameter gradient).

5. Parameter gradient divided by scaling factor.

6. Update the model parameters of float32 using the gradient of float16^[4].

**Fig 1.**Process of using Loss Scale Optimizer

Numerical Examples

For a traditional linear regression task, set up the model as：

${\hat {y}}=w\cdot x+b$

$w$ is the weight， $x$ is the input data， $b$ is the bias， ${\hat {y}}$ is the predicted value.

The objective is to minimise the loss function, i.e. the MSE：

$L(w,b)={\frac {1}{N}}\sum _{i=1}^{N}({\hat {y}}_{i}-y_{i})^{2}$

The gradient can be calculated as

${\frac {\partial L}{\partial w}}={\frac {2}{N}}\sum _{i=1}^{N}({\hat {y}}_{i}-y_{i})\cdot x_{i}$

${\frac {\partial L}{\partial b}}={\frac {2}{N}}\sum _{i=1}^{N}({\hat {y}}_{i}-y_{i})$

Suppose the input data $x=0.1$ , the true label $y=0.2$ , and the initial parameters are $w=0.01$ and ${\ce {b}}$ $=0.01$ .

The predicted value is:

${\hat {y}}=w\cdot x+b=0.01\cdot 0.1+0.01=0.0101$

Without Loss Scale

Then the loss value is:

$L={\frac {1}{2}}({\hat {y}}-y)^{2}={\frac {1}{2}}(0.0101-0.2)^{2}={\frac {1}{2}}(-0.1899)^{2}=0.017999$

The gradient can be calculated as:

${\frac {\partial L}{\partial w}}=2\cdot (0.0101-0.2)\cdot 0.1=2\cdot (-0.1899)\cdot 0.1=-0.03798$

${\frac {\partial L}{\partial b}}=2\cdot (0.0101-0.2)=2\cdot (-0.1899)=-0.379$ $8$

Use gradient descent to update the parameters $w$ and $b$ , assuming a learning rate of $\eta =1\times 10^{-3}$ .

$w$ update:

$w_{\text{new}}=w-\eta \cdot {\frac {\partial L}{\partial w}}=0.01-1\times 10^{-3}\cdot (-0.03798)=0.01+0.00003798=0.01003798$

$b$ update:

$b_{\text{new}}=b-\eta \cdot {\frac {\partial L}{\partial b}}=0.01-1\times 10^{-3}\cdot (-0.3798)=0.01+0.0003798=0.0103798$

If the model is trained at FP16, there will be gradient underflow problems with the value of $w_{\text{new}}$ . This value will be approximated as 0.01004.

With Loss Scale

To avoid gradient underflow, we introduce Loss Scale, assuming that the Loss Scale factor $s=1024$ is used.

The loss value is:

$L_{\text{scaled}}=L\times s=0.017999\times 1024=18.43097$ $6$

The gradient can be calculated as:

${\frac {\partial L_{\text{scaled}}}{\partial w}}={\frac {\partial L}{\partial w}}\times s=-0.03798\times 1024=-38.89152$

${\frac {\partial L_{\text{scaled}}}{\partial b}}={\frac {\partial L}{\partial b}}\times s=-0.3798\times 1024=-388.9152$

Using scaled gradient values for $w$ update:

$w_{\text{new}}=w-\eta \cdot {\frac {\partial L_{\text{scaled}}}{\partial {w}\cdot s}}=0.01-1\times 10^{-3}\cdot (-0.03798)=0.01+0.038815872=0.01003798$

Using scaled gradient values for $b$ update:

$b_{\text{new}}=b-\eta \cdot {\frac {\partial L_{\text{scaled}}}{\partial {b}\cdot s}}=0.01-1\times 10^{-3}\cdot (-0.3798)=0.01+0.3888128=0.0103798$

If the model is trained at FP16, the value can be stored normally.

Applications

Mixed Precision Training

When training with FP16, calculations are faster because FP16 data takes up less memory resources. But this can also lead to loss of numerical accuracy because of gradient underflow^[5]. Loss Scale Optimizer could be used to ensure the training stability by scaling the loss function. It could avoid gradients gets too small when calculating. For example, when training a large convolutional neural network, if FP16 is used to accelerate the computation, Loss Scale will dynamically scale the loss to keep the gradient computation in a reasonable range.

Self-Supervised Learning

Self-supervised learning methods use a large amount of unlabelled data during training, which may cause instability during gradient computation^[6]. Loss Scale Optimizer helps to adjust the scale of the loss function during training, avoiding instability caused by lack of precision, and ensuring that the network can converge smoothly.

Large-scale neural network training

When training large-scale neural networks (e.g., large models such as GPT, BERT, etc.), the model parameters and computation volume are very large, and the training will encounter memory and computational resource limitations^[7]. By using Loss Scale Optimizer, we can avoid the instability of gradient computation due to precision limitation^[8].

Commonly Used Tools With Loss Scale Optimizer


Tool	Description
TensorFlow	TensorFlow uses the Loss Scale Optimizer to ensure stability of gradient updates. tf.keras.mixed\_precision API can automatically handle mixed precision training
PyTorch	PyTorch supports mixed-precision training and provides an API torch.cuda.amp for mixed-precision training. Among them, the component GradScaler implements the function of Loss Scale Optimizer.
NVIDIA Apex	Apex is a PyTorch extension library from NVIDIA specifically designed to accelerate deep learning training. It includes LossScaler as a key component for handling mixed-precision training.
DeepSpeed	DeepSpeed is a deep learning optimisation library developed by Microsoft. It supports the use of mixed-precision training in large-scale training and is able to further improve the stability and performance of training through LossScaleOptimizer.
Microsoft Azure ML	In the Azure Cloud Platform, users can implement mixed precision training through the AzureML SDK.

Conclusion

Loss Scale Optimizer has a wide range of applications in the field of machine learning, especially in models that are more computationally intensive. Loss Scale Optimizer provides the feasibility of mixed-precision computation, which is less memory intensive and faster to train.

References

↑ Micikevicius P, Narang S, Alben J, et al. Mixed precision training[J]. arXiv preprint arXiv:1710.03740, 2017.
↑ Li H, Wang Y, Hong Y, et al. Layered mixed-precision training: a new training method for large-scale AI models[J]. Journal of King Saud University-Computer and Information Sciences, 2023, 35(8): 101656.
↑ Das D, Mellempudi N, Mudigere D, et al. Mixed precision training of convolutional neural networks using integer operations[J]. arXiv preprint arXiv:1802.00930, 2018.
↑ Principle of mixed precision and calculation process (AMP)[EB/OL]. [2024-11-29]. https://www.hiascend.com/document/detail/zh/Pytorch/60RC2/ptmoddevg/trainingmigrguide/PT_LMTMOG_0077.html.
↑ Mellempudi N, Srinivasan S, Das D, et al. Mixed precision training with 8-bit floating point[J]. arXiv preprint arXiv:1905.12334, 2019.
↑ Liu Q, Millis B A, Asad Z, et al. Integrate memory efficiency methods for self-supervised learning on pathological image analysis[C]//Medical Imaging 2022: Image Processing. SPIE, 2022, 12032: 695-701.
↑ Nandakumar S R, Le Gallo M, Piveteau C, et al. Mixed-precision deep learning based on computational memory[J]. Frontiers in neuroscience, 2020, 14: 406.
↑ Li H, Wang Y, Hong Y, et al. Layered mixed-precision training: a new training method for large-scale AI models[J]. Journal of King Saud University-Computer and Information Sciences, 2023, 35(8): 101656.

[1] Micikevicius P, Narang S, Alben J, et al. Mixed precision training[J]. arXiv preprint arXiv:1710.03740, 2017.

[2] Li H, Wang Y, Hong Y, et al. Layered mixed-precision training: a new training method for large-scale AI models[J]. Journal of King Saud University-Computer and Information Sciences, 2023, 35(8): 101656.

[3] Das D, Mellempudi N, Mudigere D, et al. Mixed precision training of convolutional neural networks using integer operations[J]. arXiv preprint arXiv:1802.00930, 2018.

[4] Principle of mixed precision and calculation process (AMP)[EB/OL]. [2024-11-29]. https://www.hiascend.com/document/detail/zh/Pytorch/60RC2/ptmoddevg/trainingmigrguide/PT_LMTMOG_0077.html.

[5] Mellempudi N, Srinivasan S, Das D, et al. Mixed precision training with 8-bit floating point[J]. arXiv preprint arXiv:1905.12334, 2019.

[6] Liu Q, Millis B A, Asad Z, et al. Integrate memory efficiency methods for self-supervised learning on pathological image analysis[C]//Medical Imaging 2022: Image Processing. SPIE, 2022, 12032: 695-701.

[7] Nandakumar S R, Le Gallo M, Piveteau C, et al. Mixed-precision deep learning based on computational memory[J]. Frontiers in neuroscience, 2020, 14: 406.

[8] Li H, Wang Y, Hong Y, et al. Layered mixed-precision training: a new training method for large-scale AI models[J]. Journal of King Saud University-Computer and Information Sciences, 2023, 35(8): 101656.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

LossScaleOptimizer

Contents