Subgradient optimization: Difference between revisions

From Cornell University Computational Optimization Open Textbook - Optimization Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
 
(21 intermediate revisions by 2 users not shown)
Line 1: Line 1:
This web page is a duplicate of https://optimization.mccormick.northwestern.edu/index.php/Subgradient_optimization
Author: Malichi Merski (mm2835), Ryan Ortiz (rjo64), Nicholas Phillips (ntp28) (ChemE 6800 Fall 2024)


Author Name: Aaron Anderson (ChE 345 Spring 2015) <br/>
Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu
Steward: Dajun Yue and Fengqi You


[[File:Subderivative_illustration.png|right|thumb|A convex nondifferentiable function (blue) with red "subtangent" lines generalizing the derivative at the nondifferentiable point ''x''<sub>0</sub>.]]
[[File:Subderivative_illustration.png|right|thumb|A convex nondifferentiable function (blue) with red "subtangent" lines generalizing the derivative at the nondifferentiable point ''x''<sub>0</sub>.]]
'''Subgradient Optimization''' (or '''Subgradient Method''') is an iterative algorithm for minimizing convex functions, used predominantly in Nondifferentiable optimization for functions that are convex but nondifferentiable. It is often slower than Newton's Method when applied to convex differentiable functions, but can be used on convex nondifferentiable functions where Newton's Method will not converge. It was first developed by Naum Z. Shor in the Soviet Union in the 1960's.  
The '''subgradient method''' is a simple algorithm for the optimization of non-differentiable functions, and it originated in the Soviet Union during the 1960s and 70s, primarily by the contributions of Naum Z. Shor (Sharma, Shashi). While the calculations for this approach are similar to that of the gradient method for differentiable functions, there are several key differences. First, as noted the subgradient method applies strictly to non-differentiable functions as it reduces to the gradient method when <math>f</math> is differentiable. Secondly the step size is fixed before the application of the algorithm rather than being determined “on-line” as in other approaches. Finally the subgradient method is not a descent method as the value of f can and often will increase.


==Introduction==
==Introduction==
The '''Subgradient''' (related to Subderivative and Subdifferential) of a function is a way of generalizing or approximating the derivative of a convex function at nondifferentiable points. The definition of a subgradient is as follows: <math>g</math> is a subgradient of <math>f</math> at <math>x</math> if, for all <math>y</math>, the following is true: <br/>
The subgradient method is more computationally expensive when compared to Newton's method but applicable to a wider range of problems. Additionally, due to the method’s schema when applied numerically the memory requirements are smaller than other methods allowing larger problems to be approached. Further the combination of the subgradient method with the primal dual decomposition can simplify some applications to a distributed algorithm.
[[File:Subgradient.png|200px|center]]
 
An example of the subgradient of a nondifferentiable convex function <math>f</math> can be seen below:
==Algorithm Discussion==
[[File:Subgradient2.png|600px|center]]
'''Basics:'''
Where <math>g_1</math> is a subgradient at point <math>x_1</math> and <math>g_2</math> and <math>g_3</math> are subgradients at point <math>x_2</math>. Notice that when the function is differentiable, such as at point <math>x_1</math>, the subgradient, <math>g_1</math>, just becomes the gradient to the function. Other important factors of the subgradient to note are that the subgradient gives a linear global underestimator of <math>f</math> and if <math>f</math> is convex, then there is at least one subgradient at every point in its domain. The set of all subgradients at a certain point is called the subdifferential, and is written as <math>\partial f(x_0)</math> at point <math>x_0</math>.
 
Starting with a convex function, <math>f</math>, such that <math>f:\mathbb{R}^n \to \mathbb{R}</math>. The classic implementation of the sub-gradient method iterates:
 
<math>x^{(k+1)} = x^{(k)} - \alpha_k g^{(k)}</math>
 
where <math>g(k)</math> denotes any subgradient of <math>f</math> at <math>x^{(k)}</math> and <math>x^{(k)}</math> is the <math>k^{th}</math> iteration of <math>x</math>. A subgradient of <math>f</math> is defined as:
 
<math>g^{(k)} \in \partial f(x^{(k)})</math>
 
In the case where <math>f</math> is differentiable then the subgradient reduces to <math>\nabla f</math>, recovering the gradient method. It is possible that <math>-g(k)</math>is not a descent direction of <math>f</math> at <math>x^{(k)}</math>. This means that the method requires a list of the lowest objective function values found thus far <math>f_{best}</math>:
 
<math>f^{(k)}_{\text{best}} = \min \{ f^{(k-1)}_{\text{best}}, f(x^{(k)}) \}</math>
 


==The Subgradient Method==
Suppose <math>f:\mathbb{R}^n \to \mathbb{R}</math> is a convex function with domain <math>\mathbb{R}^n</math>. To minimize <math>f</math> the subgradient method uses the iteration: <br/>
[[File:Submethod1.png|center]]
Where <math>k</math> is the number of iterations, <math>x^{(k)}</math> is the <math>k</math>th iterate, <math>g^{(x)}</math> is ''any'' subgradient at <math>x^{(k)}</math>, and <math>\alpha_k</math><math>(> 0)</math> is the <math>k</math>th step size. Thus, at each iteration of the subgradient method, we take a step in the direction of a negative subgradient. As explained above, when <math>f</math> is differentiable, <math>g^{(k)}</math> simply reduces to <math>\nabla</math><math>f(x^{(k)})</math>. It is also important to note that the subgradient method is not a descent method in that the new iterate is not always the best iterate. Thus we need some way to keep track of the best solution found so far, ''i.e.'' the one with the smallest function value. We can do this by, after each step, setting <br/>
[[File:submethod2.png|200px|center]]
and setting <math>i_{\text{best}}^{(k)} = k</math> if <math>x^{(k)}</math> is the best (smallest) point found so far. Thus we have:
[[File:submethod3.png|237px|center]]
which gives the best objective value found in <math>k</math> iterations. Since this value is decreasing, it has a limit (which can be <math>-\infty</math>). <br/>
<br/>
An algorithm flowchart is provided below for the subgradient method: <br/>
An algorithm flowchart is provided below for the subgradient method: <br/>
[[File:SMFlowsheet.png|400px|center]]
[[File:SMFlowsheet.png|400px|center]]


===Step size===
'''Step Size Considerations:'''
Several different step size rules can be used:
 
*'''Constant step size''': <math>\alpha_k = h</math> independent of <math>k</math>.
The step size for this algorithm is determined externally to the algorithm itself. There are a number of methods for determining the step size:<br/>
*'''Constant step length''': [[File:stepsize1.png]] This means that [[File:stepsize2.png]]
*Constant step size:
*'''Square summable but not summable''': These step sizes satisfy
**<math>\alpha_k = \alpha</math>
:[[File:stepsize3.png]]  
 
:One typical example is [[File:stepsize4.png]] where <math>a>0</math> and <math>b\ge0</math>.
*Constant step length:
*'''Nonsummable diminishing''': These step sizes satisfy
**<math>\alpha_k = \frac{\gamma}{\| g^{(k)} \|_2} \quad </math> which gives <math>\quad \| x^{(k+1)} - x^{(k)} \|_2 = \gamma</math>
:[[File:stepsize5.png]]
 
:One typical example is [[File:stepsize6.png]] where <math>a>0</math>.
*Square Summable but not summable step size:
An important thing to note is that for all four of the rules given here, the step sizes are determined "off-line", or before the method is iterated. Thus the step sizes do not depend on preceding iterations. This "off-line" property of subgradient methods differs from the "on-line" step size rules used for descent methods for differentiable functions where the step sizes do depend on preceding iterations.
**<math>\alpha_k \geq 0, \quad \sum_{k=1}^\infty \alpha_k^2 < \infty, \quad \sum_{k=1}^\infty \alpha_k = \infty</math>
 
*Non Summable diminishing:
**<math>\alpha_k \geq 0, \quad \lim_{k \to \infty} \alpha_k = 0, \quad \sum_{k=1}^\infty \alpha_k = \infty</math>
 
*Non Summable diminishing step lengths:
**<math>\gamma_k \geq 0, \quad \lim_{k \to \infty} \gamma_k = 0, \quad \sum_{k=1}^\infty \gamma_k = \infty</math>
 
*Polyak’s Step length:
**<math>\alpha_k = \frac{f(x^{(k)}) - f^*}{\| g^{(k)} \|_2^2}</math>
 
'''Convergence:'''
 
In the case of constant step-length and scaled subgradients having Euclidean norms equal to one, the method converges within an arbitrary error, <math>\epsilon</math>, or:
<math>\quad \lim_{k \to \infty} f^{(k)}_{\text{best}}-f*<\epsilon</math>
 
This case however is slow and has poor performance. As such, it is largely used in specialized applications due to the simplicity and adaptability for specialized problem structures.
 
==Numerical Example==
We have a piecewise linear function:
 
<math> \begin{align} f(x) =  x < -2, -2x + 8\\
 
        -2 \leq x \leq 4, -x + 11\\
 
        4 \leq x \leq 7, x + 3\\
 
        7 < x, 3x + 9\end{align} </math>
 
 
[[File:SubGradientExampleChart.png|450px|thumb|This graph illustrates the piecewise function, f(x).]]
 
'''Table 1: Raw Data'''<br/>
[[File:SubGradientExampleTable.PNG]]
 
 
'''Step 1:''' Initial guess of <math>x</math> value and step size, <math>k</math>
*<math>x = -9</math> and <math>k = 0.1</math>
 
'''Step 2:''' Calculate <math>x^{(k+1)}</math>
*<math>-9 - 0.1 \cdot (-2) = -8.8</math>
 
'''Step 3:''' Evaluate <math>f(x^{(k+1)})</math>
*<math>-2 \cdot x^{(k+1)} + 8 = -2 \cdot -8.8 + 8 = 25.6</math>
 
'''Step 4:''' Store the <math>min(f_{best}, f(x^{(k+1)})</math>
*<math>min(-8.8, 25.6) = -8.8</math>
 
'''Step 5:''' Check that error against tolerance, iterate (return to Step 2) if error > <math>\epsilon</math>
*<math>25.6 - 7 = 18.6 > \epsilon</math> so we return to Step 2 using <math>-8.8</math> as <math>x_k</math>
*On iteration 155, <math>7 - 7 = 0 < \epsilon</math> so we have solved the problem
 
'''Step 6:''' Optimal solution is determined
 
==Applications==
Subgradient methods are generally for solving non-differentiable optimization problems. This algorithm is used in data science applications such as machine learning whenever the gradient method is not sufficient. It is also found in applications like engineering where it is utilized for problems in robotics and power systems (Licio, Romao).


===Convergence Results===
In some applications, the combination of the subgradient method and the primal dual decomposition can simplify to a distributed algorithm. This is shown in detail in Adaptive Subgradient Methods for Online Learning and Stochastic Optimization (Duchi et al.). The experiments outlined focus on different data sets such as text and images, and then use the subgradient method in order to flexibly be applied in various geometries. This adaptive characteristic provides unique benefits such as improved performance for identification of predictive attributes when compared to non-adaptive alternatives.
There are different results on convergence for the subgradient method depending on the different step size rules applied.
For constant step size rules and constant step length rules the subgradient method is guaranteed to converge within some range of the optimal value. Thus:
[[File:convergence1.png|center]]
where <math>f^{*}</math> is the optimal solution to the problem and <math>\epsilon</math> is the aforementioned range of convergence. This means that the subgradient method finds a point within <math>\epsilon</math> of the optimal solution <math>f^{*}</math>. <math>\epsilon</math> is number that is a function of the step size parameter <math>h</math>, and as <math>h</math> decreases the range of convergence <math>\epsilon</math> also decreases, ''i.e.'' the solution of the subgradient method gets closer to <math>f^{*}</math> with a smaller step size parameter <math>h</math>.
For the diminishing step size rule and the square summable but not summable rule, the algorithm is guaranteed to converge to the optimal value or [[File:convergence2.png]] When the function <math>f</math> is differentiable the subgradient method with constant step size yields convergence to the optimal value, provided the parameter <math>h</math> is small enough.


==Example: Piecewise linear minimization==
Some commercial tools like MATLAB and optimization solvers like Gurobi, FICO, and MOSEK contain the subgradient method algorithm. There are also open-source solvers like Couenee and GLPK that support this function. Alternatively, CVXPY is an open-source Python embedded modeling language that contains subgradient methods in its library.
Suppose we wanted to minimize the following piecewise linear convex function using the subgradient method: <br/>
[[File:Example1.png|center]]
Since this is a linear programming problem finding a subgradient is simple: given <math>x</math> we can find an index <math>j</math> for which:
[[File:Example2_so.png|center]]
The subgradient in this case is <math>g=a_j</math>. Thus the iterative update is then:
[[File:Example3_so.png|center]]
Where <math>j</math> is chosen such to satisfy [[File:Example4_so.png]]
In order to apply the subgradient method to this problem all that is needed is some way to calculate [[File:Example5.png]] and the ability to carry out the iterative update. Even if the problem is dense and very large (where standard linear programming might fail), if there is some efficient way to calculate <math>f</math> then the subgradient method is a reasonable choice for algorithm.
Consider a problem with <math>n=10</math> variables and <math>m=100</math> terms and with data <math>a_i</math> and <math>b_i</math> generated from a normal distribution. We will consider all four of the step size rules mentioned above and will plot <math>\epsilon</math> or the difference between the optimal solution and the subgradient solution as a function of <math>k</math>, the nuber of iterations. <br/>
For the constant step size rule [[File:Example6_so.png]] for several values of <math>h</math> the following plot was obtained: <br/>
[[File:Example7.png]] <br/>
For the constant step length rule [[File:Example8.png]] for several values of <math>h</math> the following plot was obtained: <br/>
[[File:Example9.png]] <br/>
The above figures reveal a trade-off: a larger step size parameter <math>h</math> gives a faster convergence but in the end gives a larger range of suboptimality so it is important to determine an <math>h</math> that will converge close to the optimal solution without taking a very large number of iterations. <br/>
For the subgradient method using diminishing step size rules, both the nonsummable diminishing step size rule [[File:Example10.png]] (blue) and the square summable but not summable step size rule [[File:Example11.png]] (red) are plotted below for convergence: <br/>
[[File:Example12.png]] <br/>
This figure illustrates that both the nonsummable diminishing step size rule and the square summable but not summable step size rule show relatively fast and good convergence. The square summable but not summable step size rule shows less variation than the nonsummable diminishing step size rule but both show similar speed and convergence. <br/>
Overall, all four step size rules can be used to get good convergence, so it is important to try different values for <math>h</math> in the constant step size and length rules and different formulas for the nonsummable diminishing step size rule and the square summable but not summable step size rule in order to get good convergence in the smallest amount of iterations possible.


==Conclusion==
==Conclusion==
The subgradient method is a very simple algorithm for minimizing convex nondifferentiable functions where newton's method and simple linear programming will not work. While the subgradient method has a disadvantage in that it can be much slower than interior-point methods such as Newton's method, it as the advantage of the memory requirement being often times much smaller than those of an interior-point or Newton method, which means it can be used for extremely large problems for which interior-point or Newton methods cannot be used. Morever, by combining the subgradient method with primal or dual decomposition techniques, it is sometimes possible to develop a simple distributed algorithm for a problem. The subgradient method is therefor an important method to know about for solving convex minimization problems that are nondifferentiable or very large.
In summary, the subgradient method is a simple algorithm for the optimization of non-differentiable functions. While its performance is not as desirable as other algorithms, its simplicity and adaptability to problem formulation keeps it in use for a number of applications. As always, problem formulation should be a key consideration in the selection of this algorithm. A number of variations on step size and solutions exist extending the applicability of this method and it should be considered for the case of non-differentiable problems.  


==References==
==References==
1. Akgul, M. "Topics in Relaxation and Ellipsoidal Methods", volume 97 of Research Notes in Mathematics. Pitman, 1984. <br/>
1. Boyd, Stephen. “Subgradient Methods.” web.stanford.edu, 2014, https://web.stanford.edu/class/ee364b/lectures/subgrad_method_notes.pdf <br/>
2. Bazaraa, M. S., Sherali, H. D. "On the choice of step size in subgradient optimization." European Journal of Operational Research 7.4, 1981 <br/>
2. Duchi, John, et al. “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.Journal of Machine Learning Research, 11 7 2011, https://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf <br/>
3. Bertsekas, D. P. "Nonlinear Programming", (2nd edition), Athena Scientific, Belmont, MA, 1999. <br/>
3. Sharma, Shashi. “Analysis on Sub-Gradient and Semi-Definite Optimization.” Journal of Advances and Scholarly Researches in Allied Education, vol. 12, no. 2, Jan. 2017 <br/>
4. Goffin, J. L. "On convergence rates of subgradient optimization methods." Mathematical Programming 13.1, 1977. <br/>
4. Romao, Licio, et al. “Subgradient Averaging for Multi-Agent Optimisation with Different Constraint Sets.” ScienceDirect, Pergamon, 2 June 2021, www.sciencedirect.com/science/article/abs/pii/S0005109821002582 <br/>
5. Shor, N. Z. "Minimization Methods for Non-differentiable Functions". Springer Series in Computational Mathematics. Springer, 1985. <br/>
5. Shor, N. Z. "Minimization Methods for Non-differentiable Functions". Springer Series in Computational Mathematics. Springer, 1985 <br/>
6. Shor, N. Z. "Nondifferentiable Optimization and Polynomial Problems". Nonconvex Optimization and its Applications. Kluwer, 1998. <br/>

Latest revision as of 13:38, 15 December 2024

Author: Malichi Merski (mm2835), Ryan Ortiz (rjo64), Nicholas Phillips (ntp28) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

A convex nondifferentiable function (blue) with red "subtangent" lines generalizing the derivative at the nondifferentiable point x0.

The subgradient method is a simple algorithm for the optimization of non-differentiable functions, and it originated in the Soviet Union during the 1960s and 70s, primarily by the contributions of Naum Z. Shor (Sharma, Shashi). While the calculations for this approach are similar to that of the gradient method for differentiable functions, there are several key differences. First, as noted the subgradient method applies strictly to non-differentiable functions as it reduces to the gradient method when is differentiable. Secondly the step size is fixed before the application of the algorithm rather than being determined “on-line” as in other approaches. Finally the subgradient method is not a descent method as the value of f can and often will increase.

Introduction

The subgradient method is more computationally expensive when compared to Newton's method but applicable to a wider range of problems. Additionally, due to the method’s schema when applied numerically the memory requirements are smaller than other methods allowing larger problems to be approached. Further the combination of the subgradient method with the primal dual decomposition can simplify some applications to a distributed algorithm.

Algorithm Discussion

Basics:

Starting with a convex function, , such that . The classic implementation of the sub-gradient method iterates:

where denotes any subgradient of at and is the iteration of . A subgradient of is defined as:

In the case where is differentiable then the subgradient reduces to , recovering the gradient method. It is possible that is not a descent direction of at . This means that the method requires a list of the lowest objective function values found thus far :


An algorithm flowchart is provided below for the subgradient method:

Step Size Considerations:

The step size for this algorithm is determined externally to the algorithm itself. There are a number of methods for determining the step size:

  • Constant step size:
  • Constant step length:
    • which gives
  • Square Summable but not summable step size:
  • Non Summable diminishing:
  • Non Summable diminishing step lengths:
  • Polyak’s Step length:

Convergence:

In the case of constant step-length and scaled subgradients having Euclidean norms equal to one, the method converges within an arbitrary error, , or:

This case however is slow and has poor performance. As such, it is largely used in specialized applications due to the simplicity and adaptability for specialized problem structures.

Numerical Example

We have a piecewise linear function:


This graph illustrates the piecewise function, f(x).

Table 1: Raw Data


Step 1: Initial guess of value and step size,

  • and

Step 2: Calculate

Step 3: Evaluate

Step 4: Store the

Step 5: Check that error against tolerance, iterate (return to Step 2) if error >

  • so we return to Step 2 using as
  • On iteration 155, so we have solved the problem

Step 6: Optimal solution is determined

Applications

Subgradient methods are generally for solving non-differentiable optimization problems. This algorithm is used in data science applications such as machine learning whenever the gradient method is not sufficient. It is also found in applications like engineering where it is utilized for problems in robotics and power systems (Licio, Romao).

In some applications, the combination of the subgradient method and the primal dual decomposition can simplify to a distributed algorithm. This is shown in detail in Adaptive Subgradient Methods for Online Learning and Stochastic Optimization (Duchi et al.). The experiments outlined focus on different data sets such as text and images, and then use the subgradient method in order to flexibly be applied in various geometries. This adaptive characteristic provides unique benefits such as improved performance for identification of predictive attributes when compared to non-adaptive alternatives.

Some commercial tools like MATLAB and optimization solvers like Gurobi, FICO, and MOSEK contain the subgradient method algorithm. There are also open-source solvers like Couenee and GLPK that support this function. Alternatively, CVXPY is an open-source Python embedded modeling language that contains subgradient methods in its library.

Conclusion

In summary, the subgradient method is a simple algorithm for the optimization of non-differentiable functions. While its performance is not as desirable as other algorithms, its simplicity and adaptability to problem formulation keeps it in use for a number of applications. As always, problem formulation should be a key consideration in the selection of this algorithm. A number of variations on step size and solutions exist extending the applicability of this method and it should be considered for the case of non-differentiable problems.

References

1. Boyd, Stephen. “Subgradient Methods.” web.stanford.edu, 2014, https://web.stanford.edu/class/ee364b/lectures/subgrad_method_notes.pdf
2. Duchi, John, et al. “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.” Journal of Machine Learning Research, 11 7 2011, https://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf
3. Sharma, Shashi. “Analysis on Sub-Gradient and Semi-Definite Optimization.” Journal of Advances and Scholarly Researches in Allied Education, vol. 12, no. 2, Jan. 2017
4. Romao, Licio, et al. “Subgradient Averaging for Multi-Agent Optimisation with Different Constraint Sets.” ScienceDirect, Pergamon, 2 June 2021, www.sciencedirect.com/science/article/abs/pii/S0005109821002582
5. Shor, N. Z. "Minimization Methods for Non-differentiable Functions". Springer Series in Computational Mathematics. Springer, 1985