Subgradient optimization: Difference between revisions

From Cornell University Computational Optimization Open Textbook - Optimization Wiki
Jump to navigation Jump to search
No edit summary
 
(17 intermediate revisions by the same user not shown)
Line 4: Line 4:


[[File:Subderivative_illustration.png|right|thumb|A convex nondifferentiable function (blue) with red "subtangent" lines generalizing the derivative at the nondifferentiable point ''x''<sub>0</sub>.]]
[[File:Subderivative_illustration.png|right|thumb|A convex nondifferentiable function (blue) with red "subtangent" lines generalizing the derivative at the nondifferentiable point ''x''<sub>0</sub>.]]
The '''subgradient method''' is a simple algorithm for the optimization of non-differentiable functions, and it originated in the Soviet Union during the 1960s and 70s, primarily by the contributions of Naum Z. Shor (Sharma, Shashi). While the calculations for this approach are similar to that of the gradient method for differentiable functions, there are several key differences. First, as noted the subgradient method applies strictly to non-differentiable functions as it reduces to the gradient method when f is differentiable. Secondly the step size is fixed before the application of the algorithm rather than being determined “on-line” as in other approaches. Finally the subgradient method is not a descent method as the value of f can and often will increase.
The '''subgradient method''' is a simple algorithm for the optimization of non-differentiable functions, and it originated in the Soviet Union during the 1960s and 70s, primarily by the contributions of Naum Z. Shor (Sharma, Shashi). While the calculations for this approach are similar to that of the gradient method for differentiable functions, there are several key differences. First, as noted the subgradient method applies strictly to non-differentiable functions as it reduces to the gradient method when <math>f</math> is differentiable. Secondly the step size is fixed before the application of the algorithm rather than being determined “on-line” as in other approaches. Finally the subgradient method is not a descent method as the value of f can and often will increase.


==Introduction==
==Introduction==
The subgradient method is more computationally expensive when compared to Newton's method but applicable to a wider range of problems. Additionally, due to the method’s schema when applied numerically the memory requirements are smaller than other methods allowing larger problems to be approached. Further the combination of the subgradient method with the primal dual decomposition can simplify some applications to a distributed algorithm.  
The subgradient method is more computationally expensive when compared to Newton's method but applicable to a wider range of problems. Additionally, due to the method’s schema when applied numerically the memory requirements are smaller than other methods allowing larger problems to be approached. Further the combination of the subgradient method with the primal dual decomposition can simplify some applications to a distributed algorithm.


==Algorithm Discussion==
==Algorithm Discussion==
'''Basics:'''
'''Basics:'''


Starting with a convex function f such that f:RnR with domain Rn. The classic implementation of the sub-gradient method iterates:  
Starting with a convex function, <math>f</math>, such that <math>f:\mathbb{R}^n \to \mathbb{R}</math>. The classic implementation of the sub-gradient method iterates:  


<math>x^{(k+1)} = x^{(k)} - \alpha_k g^{(k)}</math>
<math>x^{(k+1)} = x^{(k)} - \alpha_k g^{(k)}</math>


where g(k) denotes any subgradient of f at x^(k)and x^(k) is the k^th iteration of x. A subgradient of f is defined as:  
where <math>g(k)</math> denotes any subgradient of <math>f</math> at <math>x^{(k)}</math> and <math>x^{(k)}</math> is the <math>k^{th}</math> iteration of <math>x</math>. A subgradient of <math>f</math> is defined as:  


<math>g^{(k)} \in \partial f(x^{(k)})</math>
<math>g^{(k)} \in \partial f(x^{(k)})</math>


In the case where f is differentiable then the subgradient reduces to ▽f, recovering the gradient method. It is possible that -g(k)is not a descent direction of f at x(k). This means that the method requires a list of the lowest objective function values found thus far fbest:
In the case where <math>f</math> is differentiable then the subgradient reduces to <math>\nabla f</math>, recovering the gradient method. It is possible that <math>-g(k)</math>is not a descent direction of <math>f</math> at <math>x^{(k)}</math>. This means that the method requires a list of the lowest objective function values found thus far <math>f_{best}</math>:


<math>f^{(k)}_{\text{best}} = \min \{ f^{(k-1)}_{\text{best}}, f(x^{(k)}) \}</math>
<math>f^{(k)}_{\text{best}} = \min \{ f^{(k-1)}_{\text{best}}, f(x^{(k)}) \}</math>




'''Step Size Considerations'''
An algorithm flowchart is provided below for the subgradient method: <br/>
[[File:SMFlowsheet.png|400px|center]]


As stated in the introduction, the step size for this algorithm is determined externally to the algorithm itself. What follows is a number of methods for determining the step size:
'''Step Size Considerations:'''


TBD
The step size for this algorithm is determined externally to the algorithm itself. There are a number of methods for determining the step size:<br/>
*Constant step size:
**<math>\alpha_k = \alpha</math>
 
*Constant step length:
**<math>\alpha_k = \frac{\gamma}{\| g^{(k)} \|_2} \quad </math> which gives <math>\quad \| x^{(k+1)} - x^{(k)} \|_2 = \gamma</math>
 
*Square Summable but not summable step size:
**<math>\alpha_k \geq 0, \quad \sum_{k=1}^\infty \alpha_k^2 < \infty, \quad \sum_{k=1}^\infty \alpha_k = \infty</math>
 
*Non Summable diminishing:
**<math>\alpha_k \geq 0, \quad \lim_{k \to \infty} \alpha_k = 0, \quad \sum_{k=1}^\infty \alpha_k = \infty</math>
 
*Non Summable diminishing step lengths:
**<math>\gamma_k \geq 0, \quad \lim_{k \to \infty} \gamma_k = 0, \quad \sum_{k=1}^\infty \gamma_k = \infty</math>
 
*Polyak’s Step length:
**<math>\alpha_k = \frac{f(x^{(k)}) - f^*}{\| g^{(k)} \|_2^2}</math>


'''Convergence:'''
'''Convergence:'''


In the case of constant step-length and scaled subgradients having Euclidean norms equal to one, the method converges within an arbitrary error є or:
In the case of constant step-length and scaled subgradients having Euclidean norms equal to one, the method converges within an arbitrary error, <math>\epsilon</math>, or:
kfbest(k)-f*<
<math>\quad \lim_{k \to \infty} f^{(k)}_{\text{best}}-f*<\epsilon</math>
 
This case however is slow and has poor performance. As such, it is largely used in specialized applications due to the simplicity and adaptability for specialized problem structures.
 
==Numerical Example==
We have a piecewise linear function:
 
<math> \begin{align} f(x) =  x < -2, -2x + 8\\


This case however is slow and has poor performance. As such, they are largely used in specialized applications due to their simplicity and adaptability to specialized problem structures.
        -2 \leq x \leq 4, -x + 11\\


        4 \leq x \leq 7, x + 3\\


The subgradient method is more computationally expensive when compared to Newton's method but applicable to a wider range of problems. Additionally, due to the method’s schema when applied numerically the memory requirements are smaller than other methods allowing larger problems to be approached. Further the combination of the subgradient method with the primal dual decomposition can simplify some applications to a distributed algorithm.
        7 < x, 3x + 9\end{align} </math>
<math>g</math> is a subgradient of <math>f</math> at <math>x</math> if, for all <math>y</math>, the following is true: <br/>
[[File:Subgradient.png|200px|center]]
An example of the subgradient of a nondifferentiable convex function <math>f</math> can be seen below:
[[File:Subgradient2.png|600px|center]]
Where <math>g_1</math> is a subgradient at point <math>x_1</math> and <math>g_2</math> and <math>g_3</math> are subgradients at point <math>x_2</math>. Notice that when the function is differentiable, such as at point <math>x_1</math>, the subgradient, <math>g_1</math>, just becomes the gradient to the function. Other important factors of the subgradient to note are that the subgradient gives a linear global underestimator of <math>f</math> and if <math>f</math> is convex, then there is at least one subgradient at every point in its domain. The set of all subgradients at a certain point is called the subdifferential, and is written as <math>\partial f(x_0)</math> at point <math>x_0</math>.


==The Subgradient Method==
Suppose <math>f:\mathbb{R}^n \to \mathbb{R}</math> is a convex function with domain <math>\mathbb{R}^n</math>. To minimize <math>f</math> the subgradient method uses the iteration: <br/>
[[File:Submethod1.png|center]]
Where <math>k</math> is the number of iterations, <math>x^{(k)}</math> is the <math>k</math>th iterate, <math>g^{(x)}</math> is ''any'' subgradient at <math>x^{(k)}</math>, and <math>\alpha_k</math><math>(> 0)</math> is the <math>k</math>th step size. Thus, at each iteration of the subgradient method, we take a step in the direction of a negative subgradient. As explained above, when <math>f</math> is differentiable, <math>g^{(k)}</math> simply reduces to <math>\nabla</math><math>f(x^{(k)})</math>. It is also important to note that the subgradient method is not a descent method in that the new iterate is not always the best iterate. Thus we need some way to keep track of the best solution found so far, ''i.e.'' the one with the smallest function value. We can do this by, after each step, setting <br/>
[[File:submethod2.png|200px|center]]
and setting <math>i_{\text{best}}^{(k)} = k</math> if <math>x^{(k)}</math> is the best (smallest) point found so far. Thus we have:
[[File:submethod3.png|237px|center]]
which gives the best objective value found in <math>k</math> iterations. Since this value is decreasing, it has a limit (which can be <math>-\infty</math>). <br/>
<br/>
An algorithm flowchart is provided below for the subgradient method: <br/>
[[File:SMFlowsheet.png|400px|center]]


===Step size===
[[File:SubGradientExampleChart.png|450px|thumb|This graph illustrates the piecewise function, f(x).]]
Several different step size rules can be used:
 
*'''Constant step size''': <math>\alpha_k = h</math> independent of <math>k</math>.
'''Table 1: Raw Data'''<br/>
*'''Constant step length''': [[File:stepsize1.png]] This means that [[File:stepsize2.png]]
[[File:SubGradientExampleTable.PNG]]
*'''Square summable but not summable''': These step sizes satisfy
 
:[[File:stepsize3.png]]
 
:One typical example is [[File:stepsize4.png]] where <math>a>0</math> and <math>b\ge0</math>.
'''Step 1:''' Initial guess of <math>x</math> value and step size, <math>k</math>
*'''Nonsummable diminishing''': These step sizes satisfy
*<math>x = -9</math> and <math>k = 0.1</math>
:[[File:stepsize5.png]]
 
:One typical example is [[File:stepsize6.png]] where <math>a>0</math>.
'''Step 2:''' Calculate <math>x^{(k+1)}</math>
An important thing to note is that for all four of the rules given here, the step sizes are determined "off-line", or before the method is iterated. Thus the step sizes do not depend on preceding iterations. This "off-line" property of subgradient methods differs from the "on-line" step size rules used for descent methods for differentiable functions where the step sizes do depend on preceding iterations.
*<math>-9 - 0.1 \cdot (-2) = -8.8</math>
 
'''Step 3:''' Evaluate <math>f(x^{(k+1)})</math>
*<math>-2 \cdot x^{(k+1)} + 8 = -2 \cdot -8.8 + 8 = 25.6</math>
 
'''Step 4:''' Store the <math>min(f_{best}, f(x^{(k+1)})</math>
*<math>min(-8.8, 25.6) = -8.8</math>


===Convergence Results===
'''Step 5:''' Check that error against tolerance, iterate (return to Step 2) if error > <math>\epsilon</math>  
There are different results on convergence for the subgradient method depending on the different step size rules applied.
*<math>25.6 - 7 = 18.6 > \epsilon</math> so we return to Step 2 using <math>-8.8</math> as <math>x_k</math>
For constant step size rules and constant step length rules the subgradient method is guaranteed to converge within some range of the optimal value. Thus:
*On iteration 155, <math>7 - 7 = 0 < \epsilon</math> so we have solved the problem
[[File:convergence1.png|center]]
where <math>f^{*}</math> is the optimal solution to the problem and <math>\epsilon</math> is the aforementioned range of convergence. This means that the subgradient method finds a point within <math>\epsilon</math> of the optimal solution <math>f^{*}</math>. <math>\epsilon</math> is number that is a function of the step size parameter <math>h</math>, and as <math>h</math> decreases the range of convergence <math>\epsilon</math> also decreases, ''i.e.'' the solution of the subgradient method gets closer to <math>f^{*}</math> with a smaller step size parameter <math>h</math>.
For the diminishing step size rule and the square summable but not summable rule, the algorithm is guaranteed to converge to the optimal value or [[File:convergence2.png]] When the function <math>f</math> is differentiable the subgradient method with constant step size yields convergence to the optimal value, provided the parameter <math>h</math> is small enough.


==Example: Piecewise linear minimization==
'''Step 6:''' Optimal solution is determined
Suppose we wanted to minimize the following piecewise linear convex function using the subgradient method: <br/>
[[File:Example1.png|center]]
Since this is a linear programming problem finding a subgradient is simple: given <math>x</math> we can find an index <math>j</math> for which:
[[File:Example2_so.png|center]]
The subgradient in this case is <math>g=a_j</math>. Thus the iterative update is then:
[[File:Example3_so.png|center]]
Where <math>j</math> is chosen such to satisfy [[File:Example4_so.png]]
In order to apply the subgradient method to this problem all that is needed is some way to calculate [[File:Example5.png]] and the ability to carry out the iterative update. Even if the problem is dense and very large (where standard linear programming might fail), if there is some efficient way to calculate <math>f</math> then the subgradient method is a reasonable choice for algorithm.
Consider a problem with <math>n=10</math> variables and <math>m=100</math> terms and with data <math>a_i</math> and <math>b_i</math> generated from a normal distribution. We will consider all four of the step size rules mentioned above and will plot <math>\epsilon</math> or the difference between the optimal solution and the subgradient solution as a function of <math>k</math>, the nuber of iterations. <br/>
For the constant step size rule [[File:Example6_so.png]] for several values of <math>h</math> the following plot was obtained: <br/>
[[File:Example7.png]] <br/>
For the constant step length rule [[File:Example8.png]] for several values of <math>h</math> the following plot was obtained: <br/>
[[File:Example9.png]] <br/>
The above figures reveal a trade-off: a larger step size parameter <math>h</math> gives a faster convergence but in the end gives a larger range of suboptimality so it is important to determine an <math>h</math> that will converge close to the optimal solution without taking a very large number of iterations. <br/>
For the subgradient method using diminishing step size rules, both the nonsummable diminishing step size rule [[File:Example10.png]] (blue) and the square summable but not summable step size rule [[File:Example11.png]] (red) are plotted below for convergence: <br/>
[[File:Example12.png]] <br/>
This figure illustrates that both the nonsummable diminishing step size rule and the square summable but not summable step size rule show relatively fast and good convergence. The square summable but not summable step size rule shows less variation than the nonsummable diminishing step size rule but both show similar speed and convergence. <br/>
Overall, all four step size rules can be used to get good convergence, so it is important to try different values for <math>h</math> in the constant step size and length rules and different formulas for the nonsummable diminishing step size rule and the square summable but not summable step size rule in order to get good convergence in the smallest amount of iterations possible.


==Application==
==Applications==
Subgradient methods are generally for solving non-differentiable optimization problems. This algorithm is used in data science applications such as machine learning whenever the gradient method is not sufficient. It is also found in applications like engineering where it is utilized for problems in robotics and power systems (Licio, Romao).
Subgradient methods are generally for solving non-differentiable optimization problems. This algorithm is used in data science applications such as machine learning whenever the gradient method is not sufficient. It is also found in applications like engineering where it is utilized for problems in robotics and power systems (Licio, Romao).


In some applications, the combination of the subgradient method and the primal dual decomposition can simplify to a distributed algorithm. This is shown in detail in Adaptive Subgradient Methods for Online Learning and Stochastic Optimization (Duchi et al.). The experiments outlined focus on different data sets such as text and images, and then use the subgradient method in order to flexibly be applied in various geometries. This adaptive characteristic provides unique benefits such as improved performance for identification of predictive attributes when compared to non-adaptive alternatives.
In some applications, the combination of the subgradient method and the primal dual decomposition can simplify to a distributed algorithm. This is shown in detail in Adaptive Subgradient Methods for Online Learning and Stochastic Optimization (Duchi et al.). The experiments outlined focus on different data sets such as text and images, and then use the subgradient method in order to flexibly be applied in various geometries. This adaptive characteristic provides unique benefits such as improved performance for identification of predictive attributes when compared to non-adaptive alternatives.


Some commercial tools like MATLAB and optimization solvers like Gurobi, FICO, and MOSEK contain the subgradient method algorithm. There are also open-source solvers like Couenee and GLPK that support this function. Alternatively, CVXPY is an open-source Python embedded modeling language that contains subgradient methods in its library.  
Some commercial tools like MATLAB and optimization solvers like Gurobi, FICO, and MOSEK contain the subgradient method algorithm. There are also open-source solvers like Couenee and GLPK that support this function. Alternatively, CVXPY is an open-source Python embedded modeling language that contains subgradient methods in its library.


==Conclusion==
==Conclusion==
Line 108: Line 103:


==References==
==References==
1. Akgul, M. "Topics in Relaxation and Ellipsoidal Methods", volume 97 of Research Notes in Mathematics. Pitman, 1984. <br/>
1. Boyd, Stephen. “Subgradient Methods.” web.stanford.edu, 2014, https://web.stanford.edu/class/ee364b/lectures/subgrad_method_notes.pdf <br/>
2. Bazaraa, M. S., Sherali, H. D. "On the choice of step size in subgradient optimization." European Journal of Operational Research 7.4, 1981 <br/>
2. Duchi, John, et al. “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.Journal of Machine Learning Research, 11 7 2011, https://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf <br/>
3. Bertsekas, D. P. "Nonlinear Programming", (2nd edition), Athena Scientific, Belmont, MA, 1999. <br/>
3. Sharma, Shashi. “Analysis on Sub-Gradient and Semi-Definite Optimization.” Journal of Advances and Scholarly Researches in Allied Education, vol. 12, no. 2, Jan. 2017 <br/>
4. Goffin, J. L. "On convergence rates of subgradient optimization methods." Mathematical Programming 13.1, 1977. <br/>
4. Romao, Licio, et al. “Subgradient Averaging for Multi-Agent Optimisation with Different Constraint Sets.” ScienceDirect, Pergamon, 2 June 2021, www.sciencedirect.com/science/article/abs/pii/S0005109821002582 <br/>
5. Shor, N. Z. "Minimization Methods for Non-differentiable Functions". Springer Series in Computational Mathematics. Springer, 1985. <br/>
5. Shor, N. Z. "Minimization Methods for Non-differentiable Functions". Springer Series in Computational Mathematics. Springer, 1985 <br/>
6. Shor, N. Z. "Nondifferentiable Optimization and Polynomial Problems". Nonconvex Optimization and its Applications. Kluwer, 1998. <br/>

Latest revision as of 13:38, 15 December 2024

Author: Malichi Merski (mm2835), Ryan Ortiz (rjo64), Nicholas Phillips (ntp28) (ChemE 6800 Fall 2024)

Stewards: Nathan Preuss, Wei-Han Chen, Tianqi Xiao, Guoqing Hu

A convex nondifferentiable function (blue) with red "subtangent" lines generalizing the derivative at the nondifferentiable point x0.

The subgradient method is a simple algorithm for the optimization of non-differentiable functions, and it originated in the Soviet Union during the 1960s and 70s, primarily by the contributions of Naum Z. Shor (Sharma, Shashi). While the calculations for this approach are similar to that of the gradient method for differentiable functions, there are several key differences. First, as noted the subgradient method applies strictly to non-differentiable functions as it reduces to the gradient method when is differentiable. Secondly the step size is fixed before the application of the algorithm rather than being determined “on-line” as in other approaches. Finally the subgradient method is not a descent method as the value of f can and often will increase.

Introduction

The subgradient method is more computationally expensive when compared to Newton's method but applicable to a wider range of problems. Additionally, due to the method’s schema when applied numerically the memory requirements are smaller than other methods allowing larger problems to be approached. Further the combination of the subgradient method with the primal dual decomposition can simplify some applications to a distributed algorithm.

Algorithm Discussion

Basics:

Starting with a convex function, , such that . The classic implementation of the sub-gradient method iterates:

where denotes any subgradient of at and is the iteration of . A subgradient of is defined as:

In the case where is differentiable then the subgradient reduces to , recovering the gradient method. It is possible that is not a descent direction of at . This means that the method requires a list of the lowest objective function values found thus far :


An algorithm flowchart is provided below for the subgradient method:

Step Size Considerations:

The step size for this algorithm is determined externally to the algorithm itself. There are a number of methods for determining the step size:

  • Constant step size:
  • Constant step length:
    • which gives
  • Square Summable but not summable step size:
  • Non Summable diminishing:
  • Non Summable diminishing step lengths:
  • Polyak’s Step length:

Convergence:

In the case of constant step-length and scaled subgradients having Euclidean norms equal to one, the method converges within an arbitrary error, , or:

This case however is slow and has poor performance. As such, it is largely used in specialized applications due to the simplicity and adaptability for specialized problem structures.

Numerical Example

We have a piecewise linear function:


This graph illustrates the piecewise function, f(x).

Table 1: Raw Data


Step 1: Initial guess of value and step size,

  • and

Step 2: Calculate

Step 3: Evaluate

Step 4: Store the

Step 5: Check that error against tolerance, iterate (return to Step 2) if error >

  • so we return to Step 2 using as
  • On iteration 155, so we have solved the problem

Step 6: Optimal solution is determined

Applications

Subgradient methods are generally for solving non-differentiable optimization problems. This algorithm is used in data science applications such as machine learning whenever the gradient method is not sufficient. It is also found in applications like engineering where it is utilized for problems in robotics and power systems (Licio, Romao).

In some applications, the combination of the subgradient method and the primal dual decomposition can simplify to a distributed algorithm. This is shown in detail in Adaptive Subgradient Methods for Online Learning and Stochastic Optimization (Duchi et al.). The experiments outlined focus on different data sets such as text and images, and then use the subgradient method in order to flexibly be applied in various geometries. This adaptive characteristic provides unique benefits such as improved performance for identification of predictive attributes when compared to non-adaptive alternatives.

Some commercial tools like MATLAB and optimization solvers like Gurobi, FICO, and MOSEK contain the subgradient method algorithm. There are also open-source solvers like Couenee and GLPK that support this function. Alternatively, CVXPY is an open-source Python embedded modeling language that contains subgradient methods in its library.

Conclusion

In summary, the subgradient method is a simple algorithm for the optimization of non-differentiable functions. While its performance is not as desirable as other algorithms, its simplicity and adaptability to problem formulation keeps it in use for a number of applications. As always, problem formulation should be a key consideration in the selection of this algorithm. A number of variations on step size and solutions exist extending the applicability of this method and it should be considered for the case of non-differentiable problems.

References

1. Boyd, Stephen. “Subgradient Methods.” web.stanford.edu, 2014, https://web.stanford.edu/class/ee364b/lectures/subgrad_method_notes.pdf
2. Duchi, John, et al. “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.” Journal of Machine Learning Research, 11 7 2011, https://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf
3. Sharma, Shashi. “Analysis on Sub-Gradient and Semi-Definite Optimization.” Journal of Advances and Scholarly Researches in Allied Education, vol. 12, no. 2, Jan. 2017
4. Romao, Licio, et al. “Subgradient Averaging for Multi-Agent Optimisation with Different Constraint Sets.” ScienceDirect, Pergamon, 2 June 2021, www.sciencedirect.com/science/article/abs/pii/S0005109821002582
5. Shor, N. Z. "Minimization Methods for Non-differentiable Functions". Springer Series in Computational Mathematics. Springer, 1985