Stochastic gradient descent: Difference between revisions

From Cornell University Computational Optimization Open Textbook - Optimization Wiki
Jump to navigation Jump to search
No edit summary
(Modified 'Theory' section and added detailed equations)
Line 5: Line 5:


== Theory ==
== Theory ==
SGD is a variation on gradient descent, also called batch gradient descent. As a review, gradient descent seeks to minimize an objective function <math>J(\theta)</math> by iteratively updating each parameter <math>\theta</math> by a small amount based on the negative gradient of a given data set. The steps for performing gradient descent are as follows:
SGD is a variation on gradient descent, also called batch gradient descent. As a review, gradient descent seeks to minimize an objective function <math>J(\theta)</math> by iteratively updating each parameter <math>\theta</math> by a small amount based on the negative gradient of a given data set. The steps for performing gradient descent are as follows:<blockquote>Step 1: Select a learning rate <math>\alpha</math>
 
<blockquote>Step 1: Select a learning rate <math>\alpha</math>


Step 2: Select initial parameter values <math>\theta</math> as the starting point
Step 2: Select initial parameter values <math>\theta</math> as the starting point


Step 3: Update all parameters from the gradient of the training data set, i.e. compute <math>\theta_{i+1}=\theta_i-\alpha\times{\nabla_\theta}J(\theta)</math></blockquote>
Step 3: Update all parameters from the gradient of the training data set, i.e. compute <math>\theta_{i+1}=\theta_i-\alpha\times{\nabla_\theta}J(\theta)</math>


===== Gradient Descent =====
Step 4: Repeat Step 3 until a local minima is reached</blockquote>
Given a function ''J''(λ) and a vector of parameters theta, gradient descent seeks to find the optimum [https://en.wikipedia.org/wiki/Parameter parameters] such that we [https://en.wikipedia.org/wiki/Mathematical_optimization minimize] ''J''(λ).<br>
min ''J''(λ)<br>
In order to optimize the parameters theta, gradient descent incrementally changes the parameters theta by computing the gradient at theta times a defined learning rate.<br>
'''''TODO: insert formula for calculating gradient against full data set using derivatives'''''<br>
This is repeated until a local minima is reached. Under batch gradient descent, the gradient is calculated at every step against a full [https://en.wikipedia.org/wiki/Data_set data set]. When the training data is large, [https://en.wikipedia.org/wiki/Computation computation] may be slow or require large amounts of [https://en.wikipedia.org/wiki/Computer_memory#:~:text=In%20computing%2C%20memory%20refers%20to,or%20related%20computer%20hardware%20device.&text=Examples%20of%20non%2Dvolatile%20memory,storing%20firmware%20such%20as%20BIOS). computer memory].<br>


Under batch gradient descent, the gradient, <math>{\nabla_\theta}J(\theta)</math>, is calculated at every step against a full [[wikipedia:Data_set|data set]]. When the training data is large, [[wikipedia:Computation|computation]] may be slow or require large amounts of [[wikipedia:Computer_memory#:~:text=In%20computing%2C%20memory%20refers%20to,or%20related%20computer%20hardware%20device.&text=Examples%20of%20non%2Dvolatile%20memory,storing%20firmware%20such%20as%20BIOS).|computer memory]].
===== Stochastic Gradient Descent Algorithm =====
===== Stochastic Gradient Descent Algorithm =====
SGD modifies the batch gradient descent algorithm by calculating the gradient for only one data set at every iteration.<br>
SGD modifies the batch gradient descent algorithm by calculating the gradient for only one data set at every iteration. The steps for performing SGD are as follows: <blockquote>Step 1: Randomly shuffle the data set of size m  
The steps for performing SGD are as follows:<br>
 
'''''TODO: Refine the steps below'''''<br>
Step 2: Select a learning rate <math>\alpha</math>  
'''Step 1:''' Randomly shuffle the data set of size m<br>
 
'''Step 2:''' Select a learning rate alpha<br>
Step 3: Select initial parameter values <math>\theta</math> as the starting point
'''Step 3:''' Select initial parameters to start<br>
 
'''Step 4:''' Repeat<br>
Step 4: Update all parameters from the gradient of a single training example <math>x^i, y^i</math>, i.e. compute <math>\theta_{i+1}=\theta_i-\alpha\times{\nabla_\theta}J(\theta;x^i;y^i)</math>  
'''''[Insert equations]'''''<br>
 
By calculating the gradient for one data set per iteration, SGD takes a less direct route towards the local minima.
Step 5: Repeat Step 4 until a local minima is reached </blockquote><br> 


===== Learning Rate =====
===== Learning Rate =====
Line 36: Line 30:
===== Mini-Batch Gradient Descent =====
===== Mini-Batch Gradient Descent =====
A variation on stochastic gradient descent is the mini-batch gradient descent. In SGD, the gradient is computed on only one training example and may result in a large number of iterations required to converge on a local minima. Mini-batch gradient descent offers a compromise between batch gradient descent and SGD by splitting the training data into smaller batches.
A variation on stochastic gradient descent is the mini-batch gradient descent. In SGD, the gradient is computed on only one training example and may result in a large number of iterations required to converge on a local minima. Mini-batch gradient descent offers a compromise between batch gradient descent and SGD by splitting the training data into smaller batches.
=== Methodology ===
[Insert text]
=== Algorithmic Discussion ===
[Insert text]


== Numerical Example ==
== Numerical Example ==
Line 69: Line 58:
# Lopes, F.F.; Ferreira, J.C.; Fernandes, M.A.C. Parallel Implementation on FPGA of Support Vector Machines Using Stochastic Gradient Descent. Electronics 2019, 8, 631.
# Lopes, F.F.; Ferreira, J.C.; Fernandes, M.A.C. Parallel Implementation on FPGA of Support Vector Machines Using Stochastic Gradient Descent. Electronics 2019, 8, 631.
# IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. III (July – Aug. 2015), PP 39-47 http://www.iosrjournals.org
# IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. III (July – Aug. 2015), PP 39-47 http://www.iosrjournals.org
# Sebastian Ruder, An overview of gradient descent optimization algorithms, https://ruder.io/optimizing-gradient-descent/index.html#batchgradientdescent

Revision as of 22:47, 19 November 2020

Stochastic gradient descent (abbreviated as SGD) is an iterative method often used for machine learning, optimizing the gradient descent during each search once a random weight vector is picked. The gradient descent is a strategy that searches through a large or infinite hypothesis space whenever 1) there are hypotheses continuously being parameterized and 2) the errors are differentiable based on the parameters. The problem with gradient descent is that converging to a local minimum takes extensive time and determining a global minimum is not guaranteed (McGrawHill, 92). The gradient descent picks any random weight vector and continuously updates it incrementally when an error calculation is completed to improve convergence (Needell, Ward, Nati Srebro, 14). The method seeks to determine the steepest descent and it reduces the number of iterations and the time taken to search large quantities of data points. Over the recent years, the data sizes have increased immensely such that current processing capabilities are not enough (Bottou, 177). Stochastic gradient descent is being used in neural networks and decreases machine computation time while increasing complexity and performance for large-scale problems.

Authors: Jonathon Price, Alfred Wong, Tiancheng Yuan, Joshua Matthews, Taiwo Olorunniwo (SYSEN 5800 Fall 2020)
Steward: Fengqi You

Theory

SGD is a variation on gradient descent, also called batch gradient descent. As a review, gradient descent seeks to minimize an objective function by iteratively updating each parameter by a small amount based on the negative gradient of a given data set. The steps for performing gradient descent are as follows:

Step 1: Select a learning rate

Step 2: Select initial parameter values as the starting point

Step 3: Update all parameters from the gradient of the training data set, i.e. compute

Step 4: Repeat Step 3 until a local minima is reached

Under batch gradient descent, the gradient, , is calculated at every step against a full data set. When the training data is large, computation may be slow or require large amounts of computer memory.

Stochastic Gradient Descent Algorithm

SGD modifies the batch gradient descent algorithm by calculating the gradient for only one data set at every iteration. The steps for performing SGD are as follows:

Step 1: Randomly shuffle the data set of size m

Step 2: Select a learning rate

Step 3: Select initial parameter values as the starting point

Step 4: Update all parameters from the gradient of a single training example , i.e. compute

Step 5: Repeat Step 4 until a local minima is reached


Learning Rate

The learning rate is used to calculate the step size at every iteration. Too large a learning rate and the step sizes may overstep too far past the optimum value. Too small a learning rate may require many iterations to reach a local minimum. A good starting point for the learning rate is 0.1.

Mini-Batch Gradient Descent

A variation on stochastic gradient descent is the mini-batch gradient descent. In SGD, the gradient is computed on only one training example and may result in a large number of iterations required to converge on a local minima. Mini-batch gradient descent offers a compromise between batch gradient descent and SGD by splitting the training data into smaller batches.

Numerical Example

[Insert text]

Application

SGD, often referred to as the cornerstone for deep learning, is an algorithm for training a wide range of models in machine learning. Due to SGD’s efficiency in dealing with large scale datasets, it is the most common method for training deep neural networks. Furthermore, SGD has received considerable attention and is applied to text classification and natural language processing. It is best suited for unconstrained optimization problems and is the main way to train large linear models on very large data sets. Implementation of stochastic gradient descent include areas in ridge regression and regularized logistic regression. Other problems, such as Lasso (Shalev-Shwartz and Tewari, 2011) and support vector machines (Menon, 2009) can be solved by stochastic gradient descent.

Support Vector Machine

SGD is a simple yet very efficient approach to fitting linear classifiers and regressors under convex functions such as (linear) Support Vector Machines (SVM). A support vector machine is a supervised machine learning model that uses classification algorithms for two-group classification problems. An SVM finds what is known as a separating hyperplane: a hyperplane (a line, in the two-dimensional case) which separates the two classes of points from one another (Sentence under review). It is a fast and dependable classification algorithm that performs very well with a limited amount of data to analyze. However, because SVM is computationally costly, software applications often do not provide sufficient performance in order to meet time requirements for large amounts of data. To improve SVM scalability regarding the size of the data set, SGD algorithms are used as a simplified procedure for evaluating the gradient of a function.

Logistic regression (Section in work)

Logistic regression models the probabilities for classification problems with two possible outcomes. It's an extension of the linear regression model for classification problems.It is a statistical technique with the input variables as continuous variables and the output variable as a binary variable. It is a class of regression where the independent variable is used to predict the dependent variable.

Full Waveform Inversion (FWI)

The Full Waveform Inversion (FWI) is a seismic imaging process by drawing information from the physical parameters of samples. Companies use the process to produce high-resolution high velocity depictions of subsurface activities. SDG supports the process because it can identify the minima and the overall global minimum in less time as there are many local minimums (Witte).

Conclusion

Stochastic Gradient Descent is an algorithm that seeks to find the steepest descent during each iteration. The process decreases the time it takes to search large data sets and determine local minima immensely. The process helps determine the global minimum. The SDG provides many applications in machine learning, geophysics, least mean squares (LMS), and other areas.

References

  1. BORDES. A., BOTTOU, L., and GALLINARI, P. (2009): SGD-QN: Careful Quasi-Newton
  2. Stochastic Gradient Descent. Journal of Machine Learning Research, 10:1737-1754
  3. Bottou, L. (1991) Stochastic gradient learning in neural networks. Proceedings of Neuro-Nımes, 91.
  4. Bottou, L. (2012) Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade, 421– 436. Springer.
  5. Philipp Witte, Mathias Louboutin, Keegan Lensink, Michael Lange, Navjot Kukreja, Fabio Luporini, Gerard Gorman, Felix J. Herrmann; Full-waveform inversion, Part 3: Optimization. The Leading Edge ; 37 (2): 142–145. doi: https://doi.org/10.1190/tle37020142.1
  6. Shalev-Shwartz, S. and Tewari, A. (2011) Stochastic methods for `1-regularized loss minimization. The Journal of Machine Learning Research, 12, 1865–1892.
  7. Lopes, F.F.; Ferreira, J.C.; Fernandes, M.A.C. Parallel Implementation on FPGA of Support Vector Machines Using Stochastic Gradient Descent. Electronics 2019, 8, 631.
  8. IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. III (July – Aug. 2015), PP 39-47 http://www.iosrjournals.org
  9. Sebastian Ruder, An overview of gradient descent optimization algorithms, https://ruder.io/optimizing-gradient-descent/index.html#batchgradientdescent