Subgradient optimization: Difference between revisions

Latest revision as of 11:23, 1 April 2022

This web page is a duplicate of https://optimization.mccormick.northwestern.edu/index.php/Subgradient_optimization

Author Name: Aaron Anderson (ChE 345 Spring 2015)
Steward: Dajun Yue and Fengqi You

A convex nondifferentiable function (blue) with red "subtangent" lines generalizing the derivative at the nondifferentiable point x₀.

Subgradient Optimization (or Subgradient Method) is an iterative algorithm for minimizing convex functions, used predominantly in Nondifferentiable optimization for functions that are convex but nondifferentiable. It is often slower than Newton's Method when applied to convex differentiable functions, but can be used on convex nondifferentiable functions where Newton's Method will not converge. It was first developed by Naum Z. Shor in the Soviet Union in the 1960's.

Introduction

The Subgradient (related to Subderivative and Subdifferential) of a function is a way of generalizing or approximating the derivative of a convex function at nondifferentiable points. The definition of a subgradient is as follows: $g$ is a subgradient of $f$ at $x$ if, for all $y$ , the following is true:

An example of the subgradient of a nondifferentiable convex function $f$ can be seen below:

Where $g_{1}$ is a subgradient at point $x_{1}$ and $g_{2}$ and $g_{3}$ are subgradients at point $x_{2}$ . Notice that when the function is differentiable, such as at point $x_{1}$ , the subgradient, $g_{1}$ , just becomes the gradient to the function. Other important factors of the subgradient to note are that the subgradient gives a linear global underestimator of $f$ and if $f$ is convex, then there is at least one subgradient at every point in its domain. The set of all subgradients at a certain point is called the subdifferential, and is written as $\partial f(x_{0})$ at point $x_{0}$ .

The Subgradient Method

Suppose $f:\mathbb {R} ^{n}\to \mathbb {R}$ is a convex function with domain $\mathbb {R} ^{n}$ . To minimize $f$ the subgradient method uses the iteration:

Where $k$ is the number of iterations, $x^{(k)}$ is the $k$ th iterate, $g^{(x)}$ is any subgradient at $x^{(k)}$ , and $\alpha _{k}$ $(>0)$ is the $k$ th step size. Thus, at each iteration of the subgradient method, we take a step in the direction of a negative subgradient. As explained above, when $f$ is differentiable, $g^{(k)}$ simply reduces to $\nabla$ $f(x^{(k)})$ . It is also important to note that the subgradient method is not a descent method in that the new iterate is not always the best iterate. Thus we need some way to keep track of the best solution found so far, i.e. the one with the smallest function value. We can do this by, after each step, setting

and setting $i_{\text{best}}^{(k)}=k$ if $x^{(k)}$ is the best (smallest) point found so far. Thus we have:

which gives the best objective value found in $k$ iterations. Since this value is decreasing, it has a limit (which can be $-\infty$ ).

An algorithm flowchart is provided below for the subgradient method:

Step size

Several different step size rules can be used:

Constant step size: $\alpha _{k}=h$ independent of $k$ .
Constant step length: This means that
Square summable but not summable: These step sizes satisfy

One typical example is

where

a>0

and

b\geq 0

.

Nonsummable diminishing: These step sizes satisfy

One typical example is

where

a>0

.

An important thing to note is that for all four of the rules given here, the step sizes are determined "off-line", or before the method is iterated. Thus the step sizes do not depend on preceding iterations. This "off-line" property of subgradient methods differs from the "on-line" step size rules used for descent methods for differentiable functions where the step sizes do depend on preceding iterations.

Convergence Results

There are different results on convergence for the subgradient method depending on the different step size rules applied. For constant step size rules and constant step length rules the subgradient method is guaranteed to converge within some range of the optimal value. Thus:

where $f^{*}$ is the optimal solution to the problem and $\epsilon$ is the aforementioned range of convergence. This means that the subgradient method finds a point within $\epsilon$ of the optimal solution $f^{*}$ . $\epsilon$ is number that is a function of the step size parameter $h$ , and as $h$ decreases the range of convergence $\epsilon$ also decreases, i.e. the solution of the subgradient method gets closer to $f^{*}$ with a smaller step size parameter $h$ . For the diminishing step size rule and the square summable but not summable rule, the algorithm is guaranteed to converge to the optimal value or When the function $f$ is differentiable the subgradient method with constant step size yields convergence to the optimal value, provided the parameter $h$ is small enough.

Example: Piecewise linear minimization

Suppose we wanted to minimize the following piecewise linear convex function using the subgradient method:

Since this is a linear programming problem finding a subgradient is simple: given $x$ we can find an index $j$ for which:

The subgradient in this case is $g=a_{j}$ . Thus the iterative update is then:

Where $j$ is chosen such to satisfy In order to apply the subgradient method to this problem all that is needed is some way to calculate and the ability to carry out the iterative update. Even if the problem is dense and very large (where standard linear programming might fail), if there is some efficient way to calculate $f$ then the subgradient method is a reasonable choice for algorithm. Consider a problem with $n=10$ variables and $m=100$ terms and with data $a_{i}$ and $b_{i}$ generated from a normal distribution. We will consider all four of the step size rules mentioned above and will plot $\epsilon$ or the difference between the optimal solution and the subgradient solution as a function of $k$ , the nuber of iterations.
For the constant step size rule for several values of $h$ the following plot was obtained:

For the constant step length rule for several values of $h$ the following plot was obtained:

The above figures reveal a trade-off: a larger step size parameter $h$ gives a faster convergence but in the end gives a larger range of suboptimality so it is important to determine an $h$ that will converge close to the optimal solution without taking a very large number of iterations.
For the subgradient method using diminishing step size rules, both the nonsummable diminishing step size rule (blue) and the square summable but not summable step size rule (red) are plotted below for convergence:

This figure illustrates that both the nonsummable diminishing step size rule and the square summable but not summable step size rule show relatively fast and good convergence. The square summable but not summable step size rule shows less variation than the nonsummable diminishing step size rule but both show similar speed and convergence.
Overall, all four step size rules can be used to get good convergence, so it is important to try different values for $h$ in the constant step size and length rules and different formulas for the nonsummable diminishing step size rule and the square summable but not summable step size rule in order to get good convergence in the smallest amount of iterations possible.

Conclusion

The subgradient method is a very simple algorithm for minimizing convex nondifferentiable functions where newton's method and simple linear programming will not work. While the subgradient method has a disadvantage in that it can be much slower than interior-point methods such as Newton's method, it as the advantage of the memory requirement being often times much smaller than those of an interior-point or Newton method, which means it can be used for extremely large problems for which interior-point or Newton methods cannot be used. Morever, by combining the subgradient method with primal or dual decomposition techniques, it is sometimes possible to develop a simple distributed algorithm for a problem. The subgradient method is therefor an important method to know about for solving convex minimization problems that are nondifferentiable or very large.

References

1. Akgul, M. "Topics in Relaxation and Ellipsoidal Methods", volume 97 of Research Notes in Mathematics. Pitman, 1984.
2. Bazaraa, M. S., Sherali, H. D. "On the choice of step size in subgradient optimization." European Journal of Operational Research 7.4, 1981
3. Bertsekas, D. P. "Nonlinear Programming", (2nd edition), Athena Scientific, Belmont, MA, 1999.
4. Goffin, J. L. "On convergence rates of subgradient optimization methods." Mathematical Programming 13.1, 1977.
5. Shor, N. Z. "Minimization Methods for Non-differentiable Functions". Springer Series in Computational Mathematics. Springer, 1985.
6. Shor, N. Z. "Nondifferentiable Optimization and Polynomial Problems". Nonconvex Optimization and its Applications. Kluwer, 1998.

@@ Line 1: / Line 1: @@
-Author Name: Aaron Anderson (ChE 345 Spring 2015)
+This web page is a duplicate of https://optimization.mccormick.northwestern.edu/index.php/Subgradient_optimization
+Author Name: Aaron Anderson (ChE 345 Spring 2015) <br/>
 Steward: Dajun Yue and Fengqi You
-A convex nondifferentiable function (blue) with red "subtangent" lines generalizing the derivative at the nondifferentiable point ''x''<sub>0</sub>.
+[[File:Subderivative_illustration.png|right|thumb|A convex nondifferentiable function (blue) with red "subtangent" lines generalizing the derivative at the nondifferentiable point ''x''<sub>0</sub>.]]
+'''Subgradient Optimization''' (or '''Subgradient Method''') is an iterative algorithm for minimizing convex functions, used predominantly in Nondifferentiable optimization for functions that are convex but nondifferentiable. It is often slower than Newton's Method when applied to convex differentiable functions, but can be used on convex nondifferentiable functions where Newton's Method will not converge. It was first developed by Naum Z. Shor in the Soviet Union in the 1960's.
-'''Subgradient Optimization''' (or '''Subgradient Method''') is an iterative algorithm for minimizing convex functions, used predominantly in Nondifferentiable optimization for functions that are convex but nondifferentiable. It is often slower than Newton's Method when applied to convex differentiable functions, but can be used on convex nondifferentiable functions where Newton's Method will not converge. It was first developed by Naum Z. Shor in the Soviet Union in the 1960's.
+==Introduction==
-{| class="wikitable"
+The '''Subgradient''' (related to Subderivative and Subdifferential) of a function is a way of generalizing or approximating the derivative of a convex function at nondifferentiable points. The definition of a subgradient is as follows: <math>g</math> is a subgradient of <math>f</math> at <math>x</math> if, for all <math>y</math>, the following is true: <br/>
-|
+[[File:Subgradient.png|200px|center]]
-== Contents ==
+An example of the subgradient of a nondifferentiable convex function <math>f</math> can be seen below:
-|}
+[[File:Subgradient2.png|600px|center]]
-{| class="wikitable"
+Where <math>g_1</math> is a subgradient at point <math>x_1</math> and <math>g_2</math> and <math>g_3</math> are subgradients at point <math>x_2</math>. Notice that when the function is differentiable, such as at point <math>x_1</math>, the subgradient, <math>g_1</math>, just becomes the gradient to the function. Other important factors of the subgradient to note are that the subgradient gives a linear global underestimator of <math>f</math> and if <math>f</math> is convex, then there is at least one subgradient at every point in its domain. The set of all subgradients at a certain point is called the subdifferential, and is written as <math>\partial f(x_0)</math> at point <math>x_0</math>.
-|
-* 1 Introduction
-* 2 The Subgradient Method
-** 2.1 Step size
-** 2.2 Convergence Results
-* 3 Example: Piecewise linear minimization
-* 4 Conclusion
-* 5 References
-|}
-== Introduction ==
+==The Subgradient Method==
-The '''Subgradient''' (related to Subderivative and Subdifferential) of a function is a way of generalizing or approximating the derivative of a convex function at nondifferentiable points. The definition of a subgradient is as follows:  is a subgradient of  at  if, for all , the following is true:
+Suppose <math>f:\mathbb{R}^n \to \mathbb{R}</math> is a convex function with domain <math>\mathbb{R}^n</math>. To minimize <math>f</math> the subgradient method uses the iteration: <br/>
+[[File:Submethod1.png|center]]
+Where <math>k</math> is the number of iterations, <math>x^{(k)}</math> is the <math>k</math>th iterate, <math>g^{(x)}</math> is ''any'' subgradient at <math>x^{(k)}</math>, and <math>\alpha_k</math><math>(> 0)</math> is the <math>k</math>th step size. Thus, at each iteration of the subgradient method, we take a step in the direction of a negative subgradient. As explained above, when <math>f</math> is differentiable, <math>g^{(k)}</math> simply reduces to <math>\nabla</math><math>f(x^{(k)})</math>. It is also important to note that the subgradient method is not a descent method in that the new iterate is not always the best iterate. Thus we need some way to keep track of the best solution found so far, ''i.e.'' the one with the smallest function value. We can do this by, after each step, setting <br/>
+[[File:submethod2.png|200px|center]]
+and setting <math>i_{\text{best}}^{(k)} = k</math> if <math>x^{(k)}</math> is the best (smallest) point found so far. Thus we have:
+[[File:submethod3.png|237px|center]]
+which gives the best objective value found in <math>k</math> iterations. Since this value is decreasing, it has a limit (which can be <math>-\infty</math>). <br/>
+<br/>
+An algorithm flowchart is provided below for the subgradient method: <br/>
+[[File:SMFlowsheet.png|400px|center]]
-An example of the subgradient of a nondifferentiable convex function  can be seen below:
+===Step size===
-Where  is a subgradient at point  and  and  are subgradients at point . Notice that when the function is differentiable, such as at point , the subgradient, , just becomes the gradient to the function. Other important factors of the subgradient to note are that the subgradient gives a linear global underestimator of  and if  is convex, then there is at least one subgradient at every point in its domain. The set of all subgradients at a certain point is called the subdifferential, and is written as  at point .
-== The Subgradient Method ==
-Suppose  is a convex function with domain . To minimize  the subgradient method uses the iteration:
-Where  is the number of iterations,  is the th iterate,  is ''any'' subgradient at , and  is the th step size. Thus, at each iteration of the subgradient method, we take a step in the direction of a negative subgradient. As explained above, when  is differentiable,  simply reduces to . It is also important to note that the subgradient method is not a descent method in that the new iterate is not always the best iterate. Thus we need some way to keep track of the best solution found so far, ''i.e.'' the one with the smallest function value. We can do this by, after each step, setting
-and setting  if  is the best (smallest) point found so far. Thus we have:
-which gives the best objective value found in  iterations. Since this value is decreasing, it has a limit (which can be ).
-An algorithm flowchart is provided below for the subgradient method:
-=== Step size ===
 Several different step size rules can be used:
+*'''Constant step size''': <math>\alpha_k = h</math> independent of <math>k</math>.
-* '''Constant step size''':  independent of .
+*'''Constant step length''': [[File:stepsize1.png]] This means that [[File:stepsize2.png]]
-* '''Constant step length''':  This means that
+*'''Square summable but not summable''': These step sizes satisfy
-* '''Square summable but not summable''': These step sizes satisfy
+:[[File:stepsize3.png]]
+:One typical example is [[File:stepsize4.png]] where <math>a>0</math> and <math>b\ge0</math>.
-:
+*'''Nonsummable diminishing''': These step sizes satisfy
-: One typical example is  where  and .
+:[[File:stepsize5.png]]
+:One typical example is [[File:stepsize6.png]] where <math>a>0</math>.
-* '''Nonsummable diminishing''': These step sizes satisfy
-:
-: One typical example is  where .
 An important thing to note is that for all four of the rules given here, the step sizes are determined "off-line", or before the method is iterated. Thus the step sizes do not depend on preceding iterations. This "off-line" property of subgradient methods differs from the "on-line" step size rules used for descent methods for differentiable functions where the step sizes do depend on preceding iterations.
-=== Convergence Results ===
+===Convergence Results===
-There are different results on convergence for the subgradient method depending on the different step size rules applied. For constant step size rules and constant step length rules the subgradient method is guaranteed to converge within some range of the optimal value. Thus:
+There are different results on convergence for the subgradient method depending on the different step size rules applied.
+For constant step size rules and constant step length rules the subgradient method is guaranteed to converge within some range of the optimal value. Thus:
-where  is the optimal solution to the problem and  is the aforementioned range of convergence. This means that the subgradient method finds a point within  of the optimal solution .  is number that is a function of the step size parameter , and as  decreases the range of convergence  also decreases, ''i.e.'' the solution of the subgradient method gets closer to  with a smaller step size parameter . For the diminishing step size rule and the square summable but not summable rule, the algorithm is guaranteed to converge to the optimal value or  When the function  is differentiable the subgradient method with constant step size yields convergence to the optimal value, provided the parameter  is small enough.
+[[File:convergence1.png|center]]
+where <math>f^{*}</math> is the optimal solution to the problem and <math>\epsilon</math> is the aforementioned range of convergence. This means that the subgradient method finds a point within <math>\epsilon</math> of the optimal solution <math>f^{*}</math>. <math>\epsilon</math> is number that is a function of the step size parameter <math>h</math>, and as <math>h</math> decreases the range of convergence <math>\epsilon</math> also decreases, ''i.e.'' the solution of the subgradient method gets closer to <math>f^{*}</math> with a smaller step size parameter <math>h</math>.
-== Example: Piecewise linear minimization ==
+For the diminishing step size rule and the square summable but not summable rule, the algorithm is guaranteed to converge to the optimal value or [[File:convergence2.png]] When the function <math>f</math> is differentiable the subgradient method with constant step size yields convergence to the optimal value, provided the parameter <math>h</math> is small enough.
-Suppose we wanted to minimize the following piecewise linear convex function using the subgradient method:
-Since this is a linear programming problem finding a subgradient is simple: given  we can find an index  for which:
-The subgradient in this case is . Thus the iterative update is then:
-Where  is chosen such to satisfy In order to apply the subgradient method to this problem all that is needed is some way to calculate  and the ability to carry out the iterative update. Even if the problem is dense and very large (where standard linear programming might fail), if there is some efficient way to calculate  then the subgradient method is a reasonable choice for algorithm. Consider a problem with  variables and  terms and with data  and  generated from a normal distribution. We will consider all four of the step size rules mentioned above and will plot  or the difference between the optimal solution and the subgradient solution as a function of , the nuber of iterations.
-For the constant step size rule  for several values of  the following plot was obtained:
-For the constant step length rule  for several values of  the following plot was obtained:
-The above figures reveal a trade-off: a larger step size parameter  gives a faster convergence but in the end gives a larger range of suboptimality so it is important to determine an  that will converge close to the optimal solution without taking a very large number of iterations.
-For the subgradient method using diminishing step size rules, both the nonsummable diminishing step size rule  (blue) and the square summable but not summable step size rule  (red) are plotted below for convergence:
-This figure illustrates that both the nonsummable diminishing step size rule and the square summable but not summable step size rule show relatively fast and good convergence. The square summable but not summable step size rule shows less variation than the nonsummable diminishing step size rule but both show similar speed and convergence.
-Overall, all four step size rules can be used to get good convergence, so it is important to try different values for  in the constant step size and length rules and different formulas for the nonsummable diminishing step size rule and the square summable but not summable step size rule in order to get good convergence in the smallest amount of iterations possible.
+==Example: Piecewise linear minimization==
+Suppose we wanted to minimize the following piecewise linear convex function using the subgradient method: <br/>
+[[File:Example1.png|center]]
+Since this is a linear programming problem finding a subgradient is simple: given <math>x</math> we can find an index <math>j</math> for which:
+[[File:Example2_so.png|center]]
+The subgradient in this case is <math>g=a_j</math>. Thus the iterative update is then:
+[[File:Example3_so.png|center]]
+Where <math>j</math> is chosen such to satisfy [[File:Example4_so.png]]
+In order to apply the subgradient method to this problem all that is needed is some way to calculate [[File:Example5.png]] and the ability to carry out the iterative update. Even if the problem is dense and very large (where standard linear programming might fail), if there is some efficient way to calculate <math>f</math> then the subgradient method is a reasonable choice for algorithm.
+Consider a problem with <math>n=10</math> variables and <math>m=100</math> terms and with data <math>a_i</math> and <math>b_i</math> generated from a normal distribution. We will consider all four of the step size rules mentioned above and will plot <math>\epsilon</math> or the difference between the optimal solution and the subgradient solution as a function of <math>k</math>, the nuber of iterations. <br/>
+For the constant step size rule [[File:Example6_so.png]] for several values of <math>h</math> the following plot was obtained: <br/>
+[[File:Example7.png]] <br/>
+For the constant step length rule [[File:Example8.png]] for several values of <math>h</math> the following plot was obtained: <br/>
+[[File:Example9.png]] <br/>
+The above figures reveal a trade-off: a larger step size parameter <math>h</math> gives a faster convergence but in the end gives a larger range of suboptimality so it is important to determine an <math>h</math> that will converge close to the optimal solution without taking a very large number of iterations. <br/>
+For the subgradient method using diminishing step size rules, both the nonsummable diminishing step size rule [[File:Example10.png]] (blue) and the square summable but not summable step size rule [[File:Example11.png]] (red) are plotted below for convergence: <br/>
+[[File:Example12.png]] <br/>
+This figure illustrates that both the nonsummable diminishing step size rule and the square summable but not summable step size rule show relatively fast and good convergence. The square summable but not summable step size rule shows less variation than the nonsummable diminishing step size rule but both show similar speed and convergence. <br/>
+Overall, all four step size rules can be used to get good convergence, so it is important to try different values for <math>h</math> in the constant step size and length rules and different formulas for the nonsummable diminishing step size rule and the square summable but not summable step size rule in order to get good convergence in the smallest amount of iterations possible.
-== Conclusion ==
+==Conclusion==
 The subgradient method is a very simple algorithm for minimizing convex nondifferentiable functions where newton's method and simple linear programming will not work. While the subgradient method has a disadvantage in that it can be much slower than interior-point methods such as Newton's method, it as the advantage of the memory requirement being often times much smaller than those of an interior-point or Newton method, which means it can be used for extremely large problems for which interior-point or Newton methods cannot be used. Morever, by combining the subgradient method with primal or dual decomposition techniques, it is sometimes possible to develop a simple distributed algorithm for a problem. The subgradient method is therefor an important method to know about for solving convex minimization problems that are nondifferentiable or very large.
-== References ==
+==References==
-. Akgul, M. "Topics in Relaxation and Ellipsoidal Methods", volume 97 of Research Notes in Mathematics. Pitman, 1984.
+. Akgul, M. "Topics in Relaxation and Ellipsoidal Methods", volume 97 of Research Notes in Mathematics. Pitman, 1984. <br/>
-. Bazaraa, M. S., Sherali, H. D. "On the choice of step size in subgradient optimization." European Journal of Operational Research 7.4, 1981
+. Bazaraa, M. S., Sherali, H. D. "On the choice of step size in subgradient optimization." European Journal of Operational Research 7.4, 1981 <br/>
-. Bertsekas, D. P. "Nonlinear Programming", (2nd edition), Athena Scientific, Belmont, MA, 1999.
+. Bertsekas, D. P. "Nonlinear Programming", (2nd edition), Athena Scientific, Belmont, MA, 1999. <br/>
-. Goffin, J. L. "On convergence rates of subgradient optimization methods." Mathematical Programming 13.1, 1977.
+. Goffin, J. L. "On convergence rates of subgradient optimization methods." Mathematical Programming 13.1, 1977. <br/>
-. Shor, N. Z. "Minimization Methods for Non-differentiable Functions". Springer Series in Computational Mathematics. Springer, 1985.
+. Shor, N. Z. "Minimization Methods for Non-differentiable Functions". Springer Series in Computational Mathematics. Springer, 1985. <br/>
-. Shor, N. Z. "Nondifferentiable Optimization and Polynomial Problems". Nonconvex Optimization and its Applications. Kluwer, 1998.
+. Shor, N. Z. "Nondifferentiable Optimization and Polynomial Problems". Nonconvex Optimization and its Applications. Kluwer, 1998. <br/>