Stochastic gradient descent: Difference between revisions

From Cornell University Computational Optimization Open Textbook - Optimization Wiki
Jump to navigation Jump to search
No edit summary
 
(45 intermediate revisions by 4 users not shown)
Line 1: Line 1:
'''Stochastic gradient descent''' (abbreviated as '''SGD''') is an iterative method often used for [https://en.wikipedia.org/wiki/Machine_learning machine learning], optimizing the [https://en.wikipedia.org/wiki/Gradient_descent gradient descent] during each search once a random weight vector is picked. The gradient descent is a strategy that searches through a large or infinite hypothesis space whenever 1) there are hypotheses continuously being parameterized and 2) the errors are differentiable based on the parameters. The problem with gradient descent is that [https://en.wikipedia.org/wiki/Convergence_(logic) converging] to a [https://en.wikipedia.org/wiki/Maxima_and_minima local minimum] takes extensive time and determining a global minimum is not guaranteed (McGrawHill, 92). The gradient descent picks any random weight vector and continuously updates it incrementally when an error calculation is completed to improve convergence (Needell, Ward, Nati Srebro, 14). The method seeks to determine the steepest descent and it reduces the number of [https://en.wikipedia.org/wiki/Iteration iterations] and the time taken to search large quantities of data points. Over the recent years, the data sizes have increased immensely such that current processing capabilities are not enough (Bottou, 177). Stochastic gradient descent is being used in [https://en.wikipedia.org/wiki/Neural_network neural networks] and decreases machine computation time while increasing complexity and performance for large-scale problems.
Authors: Jonathon Price, Alfred Wong, Tiancheng Yuan, Joshua Mathews, Taiwo Olorunniwo (SysEn 5800 Fall 2020)


Authors: Jonathon Price, Alfred Wong, Tiancheng Yuan, Joshua Matthews, Taiwo Olorunniwo (SYSEN 5800 Fall 2020)<br>
== Introduction ==
Steward: Fengqi You
'''Stochastic gradient descent''' (abbreviated as '''SGD''') is an iterative method often used for [https://en.wikipedia.org/wiki/Machine_learning machine learning], optimizing the [https://en.wikipedia.org/wiki/Gradient_descent gradient descent] during each search once a random weight vector is picked. The gradient descent is a strategy that searches through a large or infinite hypothesis space whenever 1) there are hypotheses continuously being parameterized and 2) the errors are differentiable based on the parameters. The problem with gradient descent is that [https://en.wikipedia.org/wiki/Convergence_(logic) converging] to a [https://en.wikipedia.org/wiki/Maxima_and_minima local minimum] takes extensive time and determining a global minimum is not guaranteed.<ref name=McGrawHill2003>Mitchell, T. M. (1997). Machine Learning (1st ed.). McGraw-Hill Education. Page 92. ISBN 0070428077.</ref> In SGD, the user initializes the weights and the process updates the weight vector using one data point<ref name="bishop" />. The gradient descent continuously updates it incrementally when an error calculation is completed to improve convergence.<ref name="Needell=">Needell, D., Srebro, N., & Ward, R. (2015, January). Stochastic gradient descent weighted sampling, and the randomized Kaczmarz algorithm. https://arxiv.org/pdf/1310.5715.pdf</ref> The method seeks to determine the steepest descent and it reduces the number of [https://en.wikipedia.org/wiki/Iteration iterations] and the time taken to search large quantities of data points. Over the recent years, the data sizes have increased immensely such that current processing capabilities are not enough.<ref name=Bottou1991>Bottou, L. (1991) Stochastic gradient learning in neural networks. Proceedings of Neuro-Nımes, 91. https://leon.bottou.org/publications/pdf/nimes-1991.pdf</ref> Stochastic gradient descent is being used in [https://en.wikipedia.org/wiki/Neural_network neural networks] and decreases machine computation time while increasing complexity and performance for large-scale problems.<ref name=bottou2012>Bottou, L. (2012) Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade, 421– 436. Springer.</ref>


== Theory ==
== Theory ==
[[File:Gradient Descent Visualization.png|alt=Visualization of the gradient descent algorithm|thumb|Visualization of the gradient descent algorithm<ref name=":0">Lau, S., Gonzalez, J., Nolan, D. (2020) <nowiki>https://www.textbook.ds100.org/ch/11/gradient_stochastic.html</nowiki></ref>]]
SGD is a variation on gradient descent, also called batch gradient descent. As a review, gradient descent seeks to minimize an objective function <math>J(\theta)</math> by iteratively updating each parameter <math>\theta</math> by a small amount based on the negative gradient of a given data set. The steps for performing gradient descent are as follows:<blockquote>Step 1: Select a learning rate <math>\alpha</math>
SGD is a variation on gradient descent, also called batch gradient descent. As a review, gradient descent seeks to minimize an objective function <math>J(\theta)</math> by iteratively updating each parameter <math>\theta</math> by a small amount based on the negative gradient of a given data set. The steps for performing gradient descent are as follows:<blockquote>Step 1: Select a learning rate <math>\alpha</math>


Line 13: Line 14:
Step 4: Repeat Step 3 until a local minima is reached</blockquote>
Step 4: Repeat Step 3 until a local minima is reached</blockquote>


Under batch gradient descent, the gradient, <math>{\nabla_\theta}J(\theta)</math>, is calculated at every step against a full [[wikipedia:Data_set|data set]]. When the training data is large, [[wikipedia:Computation|computation]] may be slow or require large amounts of [[wikipedia:Computer_memory#:~:text=In%20computing%2C%20memory%20refers%20to,or%20related%20computer%20hardware%20device.&text=Examples%20of%20non%2Dvolatile%20memory,storing%20firmware%20such%20as%20BIOS).|computer memory]].
Under batch gradient descent, the gradient, <math>{\nabla_\theta}J(\theta)</math>, is calculated at every step against a full [[wikipedia:Data_set|data set]]. When the training data is large, [[wikipedia:Computation|computation]] may be slow or require large amounts of [[wikipedia:Computer_memory#:~:text=In%20computing%2C%20memory%20refers%20to,or%20related%20computer%20hardware%20device.&text=Examples%20of%20non%2Dvolatile%20memory,storing%20firmware%20such%20as%20BIOS).|computer memory]].<ref name="bishop">Bishop, C. M. (2006). Pattern Recognition and Machine Learning (Information Science and Statistics). Springer.</ref>
[[File:Visualization of stochastic gradient descent.png|alt=Visualization of the stochastic gradient descent algorithm|thumb|Visualization of the stochastic gradient descent algorithm<ref name=":0" />]]
 
===== Stochastic Gradient Descent Algorithm =====
===== Stochastic Gradient Descent Algorithm =====
SGD modifies the batch gradient descent algorithm by calculating the gradient for only one data set at every iteration. The steps for performing SGD are as follows: <blockquote>Step 1: Randomly shuffle the data set of size m  
SGD modifies the batch gradient descent [https://en.wikipedia.org/wiki/Algorithm algorithm] by calculating the gradient for only one training example at every iteration.<ref name=ruder>Ruder, S. (2020, March 20). An overview of gradient descent optimization algorithms. Sebastian Ruder. https://ruder.io/optimizing-gradient-descent/index.html#batchgradientdescent</ref> The steps for performing SGD are as follows: <blockquote>Step 1: Randomly shuffle the data set of size m  


Step 2: Select a learning rate <math>\alpha</math>  
Step 2: Select a learning rate <math>\alpha</math>  
Line 21: Line 24:
Step 3: Select initial parameter values <math>\theta</math> as the starting point  
Step 3: Select initial parameter values <math>\theta</math> as the starting point  


Step 4: Update all parameters from the gradient of a single training example <math>x^i, y^i</math>, i.e. compute <math>\theta_{i+1}=\theta_i-\alpha\times{\nabla_\theta}J(\theta;x^i;y^i)</math>  
Step 4: Update all parameters from the gradient of a single training example <math>x^j, y^j</math>, i.e. compute <math>\theta_{i+1}=\theta_i-\alpha\times{\nabla_\theta}J(\theta;x^j;y^j)</math>  


Step 5: Repeat Step 4 until a local minima is reached </blockquote>By calculating the gradient for one data set per iteration, SGD takes a less direct route towards the local minima. However, SGD has the advantage of having the ability to incrementally update an objective function <math>J(\theta)</math> when new training data is available at minimum cost.
Step 5: Repeat Step 4 until a local minimum is reached </blockquote>By calculating the gradient for one data set per iteration, SGD takes a less direct route towards the local minimum. However, SGD has the advantage of having the ability to [https://en.wikipedia.org/wiki/Increment_and_decrement_operators incrementally] update an objective function <math>J(\theta)</math> when new training data is available at minimum cost.


===== Learning Rate =====
===== Learning Rate =====
The learning rate is used to calculate the step size at every iteration. Too large a learning rate and the step sizes may overstep too far past the optimum value. Too small a learning rate may require many iterations to reach a local minimum. A good starting point for the learning rate is 0.1 and adjust as necessary.
The [https://en.wikipedia.org/wiki/Learning_rate learning rate] is used to calculate the step size at every iteration. Too large a learning rate and the step sizes may overstep too far past the optimum value. Too small a learning rate may require many iterations to reach a [https://en.wikipedia.org/wiki/Maxima_and_minima local minimum]. A good starting point for the learning rate is 0.1 and adjust as necessary.<ref>Srinivasan, A. (2019, September) Stochastic Gradient Descent — Clearly Explained. https://towardsdatascience.com/stochastic-gradient-descent-clearly-explained-53d239905d31</ref>
 
===== Mini-Batch Gradient Descent =====
===== Mini-Batch Gradient Descent =====
A variation on stochastic gradient descent is the mini-batch gradient descent. In SGD, the gradient is computed on only one training example and may result in a large number of iterations required to converge on a local minima. Mini-batch gradient descent offers a compromise between batch gradient descent and SGD by splitting the training data into smaller batches. The steps for performing mini-batch gradient descent are identical to SGD with one exception - when updating the parameters from the gradient, rather than calculating the gradient of a single training example, the gradient is calculated against a batch size of <math>n</math> training examples, i.e. compute <math>\theta_{i+1}=\theta_i-\alpha\times{\nabla_\theta}J(\theta;x^{i:i+n};y^{i:i+n})</math>
A variation on stochastic gradient descent is the mini-batch gradient descent. In SGD, the gradient is computed on only one training example and may result in a large number of iterations required to converge on a local minimum. Mini-batch gradient descent offers a compromise between batch gradient descent and SGD by splitting the training data into smaller batches. The steps for performing mini-batch gradient descent are identical to SGD with one exception - when updating the parameters from the gradient, rather than calculating the gradient of a single training example, the gradient is calculated against a batch size of <math>n</math> training examples, i.e. compute <math>\theta_{i+1}=\theta_i-\alpha\times{\nabla_\theta}J(\theta;x^{j:j+n};y^{j:j+n})</math>


== Numerical Example ==
== Numerical Example ==
===== Data preparation =====
===== Data preparation =====
Consider a simple 2-D data set with only 6 data points (each point has <math>x_1, x_2</math>), and each data point have their label value <math>y</math> assigned to them.
Consider a simple 2-D data set with only 6 data points (each point has <math>x_1, x_2</math>), and each data point have a label value <math>y</math> assigned to them.
===== Model overview =====
===== Model overview =====
For the purpose of demonstrating the computation of the SGD process, we simply employ a linear regression model:  <math>y = w_1\ x_1 + w_2\ x_2 + b </math>, where <math>w_1</math> and <math>w_2</math> are weights and <math>b</math> is the bias term.  In this case, the goal of this model is to find the best value for <math>w_1, w_2</math> and <math>b</math>, based on the datasets.
For the purpose of demonstrating the computation of the SGD process, simply employ a linear regression model:  <math>y = w_1\ x_1 + w_2\ x_2 + b </math>, where <math>w_1</math> and <math>w_2</math> are weights and <math>b</math> is the constant term.  In this case, the goal of this model is to find the best value for <math>w_1, w_2</math> and <math>b</math>, based on the datasets.
===== Definition of loss function =====
===== Definition of loss function =====
In this example, the loss function should be l2 norm square, that is  <math>L = (\widehat{y} - y)^2 </math>.
In this example, the loss function should be l2 norm square, that is  <math>L = (\widehat{y} - y)^2 </math>.
===== Forward =====
===== Forward =====
<blockquote>'''Initial Weights:'''
<blockquote>'''Initial Weights:'''
The linear regression model starts by initializing the weights <math>w_1, w_2</math> and setting the bias term at 0.  In this case, we initiate [<math>w_1, w_2</math>] = [-0.017, -0.048].
The linear regression model starts by [https://en.wikipedia.org/wiki/Initialization_(programming) initializing] the weights <math>w_1, w_2</math> and setting the bias term at 0.  In this case, initiate [<math>w_1, w_2</math>] = [-0.044, -0.042].


'''Dataset:'''
'''Dataset:'''


After the initialization of weights, the model then tries to confirm the batch size and data input on every epoch during training.  In this problem, we have 1 for batch size and the entire dataset is:
For this problem, the batch size is set to 1 and the entire dataset of [ <math>x_1</math>, <math>x_2</math>, <math>y</math>] is given by:
{|
{| class="wikitable"
|+
! <math>x_1</math> !! <math>x_2</math> !! <math>y</math>
!
!<math>x_1</math>
!<math>x_2</math>
!<math>y</math>
|-
|-
|1)
| 4 || 1 || 2
|4
|1
|2
|-
|-
|2)
| 2 || 8 || -14
|2
|8
| -14
|-
|-
|3)
| 1 || 0 || 1
|1
|0
|1
|-
|-
|4)
| 3 || 2 || -1
|3
|2
| -1
|-
|-
|5)
| 1 || 4 || -7
|1
|4
| -7
|-
|-
|6)
| 6 || 7 || -8
|6
|7
| -8
|}
|}
</blockquote>


===== Backpropagation (BP) =====
===== Gradient Computation and Parameter Update =====
The purpose of bp is to obtain the impact of the weights and bias terms for the entire model.  The update of the model is entirely dependent on the gradient values.  To minimize the loss during the process, the model needs to ensure the gradient is dissenting so that it could finally converge to a global optimal point.  All the 3 partial differential equations are shown as:
The purpose of BP is to obtain the impact of the weights and bias terms for the entire model.  The update of the model is entirely dependent on the gradient values. To minimize the loss during the process, the model needs to ensure the gradient is dissenting so that it could finally converge to a global optimal point.  All the 3 partial differential equations are shown as:


<math>\omega_1^' = \omega_1 - \eta\ {\partial L\over\partial \omega_1} = \omega_1 - \eta\ {\partial L\over\partial \widehat{y}}\cdot {\partial \widehat{y}\over\partial \omega_1} = \omega_1 - \eta\ [2(\widehat{y} - y)\cdot x_1] </math>
<math>\omega_1^' = \omega_1 - \eta\ {\partial L\over\partial \omega_1} = \omega_1 - \eta\ {\partial L\over\partial \widehat{y}}\cdot {\partial \widehat{y}\over\partial \omega_1} = \omega_1 - \eta\ [2(\widehat{y} - y)\cdot x_1] </math>
Line 93: Line 72:
<math>b^' = b - \eta\ {\partial L\over\partial b} = b - \eta\ {\partial L\over\partial \widehat{y}}\cdot {\partial \widehat{y}\over\partial b} = b - \eta\ [2(\widehat{y} - y)\cdot 1]</math>
<math>b^' = b - \eta\ {\partial L\over\partial b} = b - \eta\ {\partial L\over\partial \widehat{y}}\cdot {\partial \widehat{y}\over\partial b} = b - \eta\ [2(\widehat{y} - y)\cdot 1]</math>


,where the <math>\eta</math> stands for the learning rate, in this model, it is set to be 0.05.  To update each parameter, one just simply substitudes the value of resulting <math>\widehat{y}</math> in.
Where the <math>\eta</math> stands for the learning rate and in this model, is set to be 0.05.  To update each parameter, simply substitute the value of resulting <math>\widehat{y}</math>.
 
Use the first data point [<math>x_1, x_2</math>] = [4, 1] and the corresponding <math>y</math> being 2.  The <math>\widehat{y}</math> the model gave should be -0.2.  Now with <math>\widehat{y}</math> and <math>y</math> value, update the new parameters as [0.843, 0.179, 0.222] = [<math>w'_1, w'_2, b'</math>].  That marks the end of iteration 1.


Let's just use our first data point [<math>x_1, x_2</math>] = 4, 1 and its corresponding <math>y</math> being 2.  So the <math>\widehat{y}</math> our model gave should be -0.116.  Now with <math>\widehat{y}</math> and <math>y</math> value, we update the new parameters as [0.829, 0.164, 0.212] = [<math>w'_1, w'_2, b'</math>].  And that marks the end of iteration 1.
Now, iteration 2 begins, with the next data point [2, 8] and the label -14. The estimation , <math>\widehat{y}</math> is now 3.3. With the new <math>\widehat{y}</math> and <math>y</math> value, once again, we update the weight as [-2.625, -13.696, 1.513]. And that marks the end of iteration 2.


Keep on updating the model, we will finally have [<math>w_1, w_2, b</math>] = [0.43, -0.21, 0.77].
Keep on updating the model through additional iterations to output [<math>w_1, w_2, b</math>] = [-19.021, -35.812, -1.232].


This is just a simple demonstration of the SGD process.  In actual practice, we could always have more epochs to run through the entire dataset enough times to ensure the best learning results based on the training dataset.  But learning overly specific with the training dataset could sometimes also expose the model to the risk of overfitting.  Therefore tuning such parameters is quite tricky and often costs us days or even weeks before we find the best results.
This is just a simple demonstration of the SGD process.  In actual practice, more epochs can be utilized to run through the entire dataset enough times to ensure the best learning results based on the training dataset<ref name=":1">Lawrence, S., & Giles, C. L. (2000). Overfitting and neural networks: conjugate gradient and backpropagation. Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium, 1, 114–119. https://doi.org/10.1109/ijcnn.2000.857823</ref>.  But learning overly specific with the training dataset could sometimes also expose the model to the risk of overfitting<ref name=":1" />.  Therefore, tuning such parameters is quite tricky and often costs days or even weeks before finding the best results.


==Application==
==Application==
SGD, often referred to as the cornerstone for deep learning, is an algorithm for training a wide range of models in machine learning. Due to SGD’s efficiency in dealing with large scale datasets, it is the most common method for training deep neural networks. Furthermore, SGD has received considerable attention and is applied to text classification and natural language processing. It is best suited for unconstrained optimization problems and is the main way to train large linear models on very large data sets. Implementation of stochastic gradient descent include areas in ridge regression and regularized logistic regression. Other problems, such as Lasso (Shalev-Shwartz and Tewari, 2011) and support vector machines (Menon, 2009) can be solved by stochastic gradient descent.
SGD, often referred to as the cornerstone for deep learning, is an algorithm for training a wide range of models in machine learning. [[wikipedia:Deep_learning|Deep learning]] is a machine learning technique that teaches computers to do what comes naturally to humans. In deep learning, a computer model learns to perform classification tasks directly from images, text, or sound. Models are trained by using a large set of labeled data and neural network architectures that contain many layers. Neural networks make up the backbone of deep learning algorithms. A neural network that consists of more than three layers which would be inclusive of the inputs and the output can be considered a deep learning algorithm. Due to SGD’s efficiency in dealing with large scale datasets, it is the most common method for training [https://en.wikipedia.org/wiki/Deep_learning#Deep_neural_networks deep neural networks]. Furthermore, SGD has received considerable attention and is applied to text classification and [https://en.wikipedia.org/wiki/Natural_language_processing natural language processing]. It is best suited for unconstrained optimization problems and is the main way to train large linear models on very large data sets. Implementation of stochastic gradient descent include areas in [https://en.wikipedia.org/wiki/Tikhonov_regularization ridge regression] and regularized [https://en.wikipedia.org/wiki/Logistic_regression logistic regression]. Other problems, such as Lasso<ref name="Shwartz">Shalev-Shwartz, S. and Tewari, A. (2011) Stochastic methods for ℓ<math>_1</math>-regularized loss minimization. The Journal of Machine Learning Research, 12, 1865–1892. https://www.jmlr.org/papers/volume12/shalev-shwartz11a/shalev-shwartz11a.pdf</ref> and support vector machines<ref name=Menon>Menon, A. (2009, February). Large-Scale Support Vector Machines: Algorithms and Theory. http://cseweb.ucsd.edu/~akmenon/ResearchExamTalk.pdf</ref> can be solved by stochastic gradient descent.


===Support Vector Machine===
===Support Vector Machine===
SGD is a simple yet very efficient approach to fitting linear classifiers and regressors under convex functions such as (linear) Support Vector Machines (SVM). A support vector machine is a supervised machine learning model that uses classification algorithms for two-group classification problems. An SVM finds what is known as a separating hyperplane: a hyperplane (a line, in the two-dimensional case) which separates the two classes of points from one another. It is a fast and dependable classification algorithm that performs very well with a limited amount of data to analyze. However, because SVM is computationally costly, software applications often do not provide sufficient performance in order to meet time requirements for large amounts of data. To improve SVM scalability regarding the size of the data set, SGD algorithms are used as a simplified procedure for evaluating the gradient of a function.
SGD is a simple yet very efficient approach to fitting linear classifiers and regressors under convex functions such as (linear) [https://en.wikipedia.org/wiki/Support_vector_machine Support Vector Machines] (SVM). A support vector machine is a supervised machine learning model that uses classification algorithms for two-group classification problems. An SVM finds what is known as a separating hyperplane: a hyperplane (a line, in the two-dimensional case) which separates the two classes of points from one another. It is a fast and dependable classification algorithm that performs very well with a limited amount of data to analyze. However, because SVM is computationally costly, software applications often do not provide sufficient performance in order to meet time requirements for large amounts of data. To improve SVM scalability regarding the size of the data set, SGD algorithms are used as a simplified procedure for evaluating the gradient of a function.<ref name=lopes>Lopes, F.F.; Ferreira, J.C.; Fernandes, M.A.C. Parallel Implementation on FPGA of Support Vector Machines Using Stochastic Gradient Descent. Electronics 2019, 8, 631.</ref>


===Logistic regression===
===Logistic regression===
Logistic regression models the probabilities for classification problems with two possible outcomes. It's an extension of the linear regression model for classification problems. It is a statistical technique with the input variables as continuous variables and the output variable as a binary variable. It is a class of regression where the independent variable is used to predict the dependent variable.
Logistic regression models the [https://en.wikipedia.org/wiki/Probability probabilities] for classification problems with two possible outcomes. It's an extension of the linear regression model for classification problems. It is a statistical technique with the input variables as continuous variables and the output variable as a binary variable. It is a class of regression where the independent variable is used to predict the dependent variable. The objective of training a machine learning model is to minimize the loss or error between ground truths and predictions by changing the trainable parameters. Logistic regression has two phases: training, and testing. The system, specifically the weights w and b, is trained using stochastic gradient descent and the cross-entropy loss.  


===Full Waveform Inversion (FWI)===
===Full Waveform Inversion (FWI)===
The Full Waveform Inversion (FWI) is a seismic imaging process by drawing information from the physical parameters of samples. Companies use the process to produce high-resolution high velocity depictions of subsurface activities. SDG supports the process because it can identify the minima and the overall global minimum in less time as there are many local minimums (Witte).  
The Full Waveform Inversion (FWI) is a [https://en.wikipedia.org/wiki/Geophysical_imaging seismic imaging] process by drawing information from the physical parameters of samples. Companies use the process to produce high-resolution high velocity depictions of subsurface activities. SGD supports the process because it can identify the minima and the overall global minimum in less time as there are many local minimums.<ref name=witte>Witte, P., Louboutin, M., Lensink, K., Lange, M., Kukreja, N., Luporini, F., Gorman, G., Herrmann, F.J.; Full-waveform inversion, Part 3: Optimization. The Leading Edge ; 37 (2): 142–145. doi: https://doi.org/10.1190/tle37020142.1</ref>


==Conclusion==
==Conclusion==
Stochastic Gradient Descent is an algorithm that seeks to find the steepest descent during each iteration.  The process decreases the time it takes to search large data sets and determine local minima immensely. The process helps determine the global minimum. The SDG provides many applications in machine learning, geophysics, least mean squares (LMS), and other areas.
SGD is an algorithm that seeks to find the steepest descent during each iteration.  The process decreases the time it takes to search large data sets and determine local minima immensely. The SGD provides many applications in machine learning, geophysics, least mean squares (LMS), and other areas.


==References==
==References==
#Bordes, A., Bottou, L., & Gallinari, P. (2009, July). SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent. Journal of Machine Learning Research 10:1737-1754. https://jmlr.org/papers/volume10/bordes09a/bordes09a.pdf
 
#Bottou, L. (1991) Stochastic gradient learning in neural networks. Proceedings of Neuro-Nımes, 91. https://leon.bottou.org/publications/pdf/nimes-1991.pdf
<references />
#Bottou, L. (2012) Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade, 421– 436. Springer.
#Witte, P., Louboutin, M., Lensink, K., Lange, M., Kukreja, N., Luporini, F., Gorman, G., Herrmann, F.J.; Full-waveform inversion, Part 3: Optimization. The Leading Edge ; 37 (2): 142–145. doi: https://doi.org/10.1190/tle37020142.1
#Shalev-Shwartz, S. and Tewari, A. (2011) Stochastic methods for `<math>ℓ_1</math>-regularized loss minimization. The Journal of Machine Learning Research, 12, 1865–1892. https://www.jmlr.org/papers/volume12/shalev-shwartz11a/shalev-shwartz11a.pdf
#Lopes, F.F.; Ferreira, J.C.; Fernandes, M.A.C. Parallel Implementation on FPGA of Support Vector Machines Using Stochastic Gradient Descent. Electronics 2019, 8, 631.
#IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. III (July – Aug. 2015), PP 39-47 http://www.iosrjournals.org
#Ruder, S. (2020, March 20). An overview of gradient descent optimization algorithms. Sebastian Ruder. https://ruder.io/optimizing-gradient-descent/index.html#batchgradientdescent

Latest revision as of 07:41, 21 December 2020

Authors: Jonathon Price, Alfred Wong, Tiancheng Yuan, Joshua Mathews, Taiwo Olorunniwo (SysEn 5800 Fall 2020)

Introduction

Stochastic gradient descent (abbreviated as SGD) is an iterative method often used for machine learning, optimizing the gradient descent during each search once a random weight vector is picked. The gradient descent is a strategy that searches through a large or infinite hypothesis space whenever 1) there are hypotheses continuously being parameterized and 2) the errors are differentiable based on the parameters. The problem with gradient descent is that converging to a local minimum takes extensive time and determining a global minimum is not guaranteed.[1] In SGD, the user initializes the weights and the process updates the weight vector using one data point[2]. The gradient descent continuously updates it incrementally when an error calculation is completed to improve convergence.[3] The method seeks to determine the steepest descent and it reduces the number of iterations and the time taken to search large quantities of data points. Over the recent years, the data sizes have increased immensely such that current processing capabilities are not enough.[4] Stochastic gradient descent is being used in neural networks and decreases machine computation time while increasing complexity and performance for large-scale problems.[5]

Theory

Visualization of the gradient descent algorithm
Visualization of the gradient descent algorithm[6]

SGD is a variation on gradient descent, also called batch gradient descent. As a review, gradient descent seeks to minimize an objective function by iteratively updating each parameter by a small amount based on the negative gradient of a given data set. The steps for performing gradient descent are as follows:

Step 1: Select a learning rate

Step 2: Select initial parameter values as the starting point

Step 3: Update all parameters from the gradient of the training data set, i.e. compute

Step 4: Repeat Step 3 until a local minima is reached

Under batch gradient descent, the gradient, , is calculated at every step against a full data set. When the training data is large, computation may be slow or require large amounts of computer memory.[2]

Visualization of the stochastic gradient descent algorithm
Visualization of the stochastic gradient descent algorithm[6]
Stochastic Gradient Descent Algorithm

SGD modifies the batch gradient descent algorithm by calculating the gradient for only one training example at every iteration.[7] The steps for performing SGD are as follows:

Step 1: Randomly shuffle the data set of size m

Step 2: Select a learning rate

Step 3: Select initial parameter values as the starting point

Step 4: Update all parameters from the gradient of a single training example , i.e. compute

Step 5: Repeat Step 4 until a local minimum is reached

By calculating the gradient for one data set per iteration, SGD takes a less direct route towards the local minimum. However, SGD has the advantage of having the ability to incrementally update an objective function when new training data is available at minimum cost.

Learning Rate

The learning rate is used to calculate the step size at every iteration. Too large a learning rate and the step sizes may overstep too far past the optimum value. Too small a learning rate may require many iterations to reach a local minimum. A good starting point for the learning rate is 0.1 and adjust as necessary.[8]

Mini-Batch Gradient Descent

A variation on stochastic gradient descent is the mini-batch gradient descent. In SGD, the gradient is computed on only one training example and may result in a large number of iterations required to converge on a local minimum. Mini-batch gradient descent offers a compromise between batch gradient descent and SGD by splitting the training data into smaller batches. The steps for performing mini-batch gradient descent are identical to SGD with one exception - when updating the parameters from the gradient, rather than calculating the gradient of a single training example, the gradient is calculated against a batch size of training examples, i.e. compute

Numerical Example

Data preparation

Consider a simple 2-D data set with only 6 data points (each point has ), and each data point have a label value assigned to them.

Model overview

For the purpose of demonstrating the computation of the SGD process, simply employ a linear regression model: , where and are weights and is the constant term. In this case, the goal of this model is to find the best value for and , based on the datasets.

Definition of loss function

In this example, the loss function should be l2 norm square, that is .

Forward

Initial Weights:

The linear regression model starts by initializing the weights and setting the bias term at 0. In this case, initiate [] = [-0.044, -0.042].

Dataset:

For this problem, the batch size is set to 1 and the entire dataset of [ , , ] is given by:

4 1 2
2 8 -14
1 0 1
3 2 -1
1 4 -7
6 7 -8
Gradient Computation and Parameter Update

The purpose of BP is to obtain the impact of the weights and bias terms for the entire model.  The update of the model is entirely dependent on the gradient values. To minimize the loss during the process, the model needs to ensure the gradient is dissenting so that it could finally converge to a global optimal point.  All the 3 partial differential equations are shown as:

Where the stands for the learning rate and in this model, is set to be 0.05. To update each parameter, simply substitute the value of resulting .

Use the first data point [] = [4, 1] and the corresponding being 2.  The the model gave should be -0.2.  Now with and value, update the new parameters as [0.843, 0.179, 0.222] = [].  That marks the end of iteration 1.

Now, iteration 2 begins, with the next data point [2, 8] and the label -14. The estimation , is now 3.3. With the new and value, once again, we update the weight as [-2.625, -13.696, 1.513]. And that marks the end of iteration 2.

Keep on updating the model through additional iterations to output [] = [-19.021, -35.812, -1.232].

This is just a simple demonstration of the SGD process.  In actual practice, more epochs can be utilized to run through the entire dataset enough times to ensure the best learning results based on the training dataset[9].  But learning overly specific with the training dataset could sometimes also expose the model to the risk of overfitting[9].  Therefore, tuning such parameters is quite tricky and often costs days or even weeks before finding the best results.

Application

SGD, often referred to as the cornerstone for deep learning, is an algorithm for training a wide range of models in machine learning. Deep learning is a machine learning technique that teaches computers to do what comes naturally to humans. In deep learning, a computer model learns to perform classification tasks directly from images, text, or sound. Models are trained by using a large set of labeled data and neural network architectures that contain many layers. Neural networks make up the backbone of deep learning algorithms. A neural network that consists of more than three layers which would be inclusive of the inputs and the output can be considered a deep learning algorithm. Due to SGD’s efficiency in dealing with large scale datasets, it is the most common method for training deep neural networks. Furthermore, SGD has received considerable attention and is applied to text classification and natural language processing. It is best suited for unconstrained optimization problems and is the main way to train large linear models on very large data sets. Implementation of stochastic gradient descent include areas in ridge regression and regularized logistic regression. Other problems, such as Lasso[10] and support vector machines[11] can be solved by stochastic gradient descent.

Support Vector Machine

SGD is a simple yet very efficient approach to fitting linear classifiers and regressors under convex functions such as (linear) Support Vector Machines (SVM). A support vector machine is a supervised machine learning model that uses classification algorithms for two-group classification problems. An SVM finds what is known as a separating hyperplane: a hyperplane (a line, in the two-dimensional case) which separates the two classes of points from one another. It is a fast and dependable classification algorithm that performs very well with a limited amount of data to analyze. However, because SVM is computationally costly, software applications often do not provide sufficient performance in order to meet time requirements for large amounts of data. To improve SVM scalability regarding the size of the data set, SGD algorithms are used as a simplified procedure for evaluating the gradient of a function.[12]

Logistic regression

Logistic regression models the probabilities for classification problems with two possible outcomes. It's an extension of the linear regression model for classification problems. It is a statistical technique with the input variables as continuous variables and the output variable as a binary variable. It is a class of regression where the independent variable is used to predict the dependent variable. The objective of training a machine learning model is to minimize the loss or error between ground truths and predictions by changing the trainable parameters. Logistic regression has two phases: training, and testing. The system, specifically the weights w and b, is trained using stochastic gradient descent and the cross-entropy loss.

Full Waveform Inversion (FWI)

The Full Waveform Inversion (FWI) is a seismic imaging process by drawing information from the physical parameters of samples. Companies use the process to produce high-resolution high velocity depictions of subsurface activities. SGD supports the process because it can identify the minima and the overall global minimum in less time as there are many local minimums.[13]

Conclusion

SGD is an algorithm that seeks to find the steepest descent during each iteration. The process decreases the time it takes to search large data sets and determine local minima immensely. The SGD provides many applications in machine learning, geophysics, least mean squares (LMS), and other areas.

References

  1. Mitchell, T. M. (1997). Machine Learning (1st ed.). McGraw-Hill Education. Page 92. ISBN 0070428077.
  2. 2.0 2.1 Bishop, C. M. (2006). Pattern Recognition and Machine Learning (Information Science and Statistics). Springer.
  3. Needell, D., Srebro, N., & Ward, R. (2015, January). Stochastic gradient descent weighted sampling, and the randomized Kaczmarz algorithm. https://arxiv.org/pdf/1310.5715.pdf
  4. Bottou, L. (1991) Stochastic gradient learning in neural networks. Proceedings of Neuro-Nımes, 91. https://leon.bottou.org/publications/pdf/nimes-1991.pdf
  5. Bottou, L. (2012) Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade, 421– 436. Springer.
  6. 6.0 6.1 Lau, S., Gonzalez, J., Nolan, D. (2020) https://www.textbook.ds100.org/ch/11/gradient_stochastic.html
  7. Ruder, S. (2020, March 20). An overview of gradient descent optimization algorithms. Sebastian Ruder. https://ruder.io/optimizing-gradient-descent/index.html#batchgradientdescent
  8. Srinivasan, A. (2019, September) Stochastic Gradient Descent — Clearly Explained. https://towardsdatascience.com/stochastic-gradient-descent-clearly-explained-53d239905d31
  9. 9.0 9.1 Lawrence, S., & Giles, C. L. (2000). Overfitting and neural networks: conjugate gradient and backpropagation. Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium, 1, 114–119. https://doi.org/10.1109/ijcnn.2000.857823
  10. Shalev-Shwartz, S. and Tewari, A. (2011) Stochastic methods for ℓ-regularized loss minimization. The Journal of Machine Learning Research, 12, 1865–1892. https://www.jmlr.org/papers/volume12/shalev-shwartz11a/shalev-shwartz11a.pdf
  11. Menon, A. (2009, February). Large-Scale Support Vector Machines: Algorithms and Theory. http://cseweb.ucsd.edu/~akmenon/ResearchExamTalk.pdf
  12. Lopes, F.F.; Ferreira, J.C.; Fernandes, M.A.C. Parallel Implementation on FPGA of Support Vector Machines Using Stochastic Gradient Descent. Electronics 2019, 8, 631.
  13. Witte, P., Louboutin, M., Lensink, K., Lange, M., Kukreja, N., Luporini, F., Gorman, G., Herrmann, F.J.; Full-waveform inversion, Part 3: Optimization. The Leading Edge ; 37 (2): 142–145. doi: https://doi.org/10.1190/tle37020142.1