Cornell University Computational Optimization Open Textbook - Optimization Wiki - User contributions [en]

Sequential quadratic programming

2022-04-01T15:42:32Z

Btantisujjatham:

This web page is a duplicate of https://optimization.mccormick.northwestern.edu/index.php/Sequential_quadratic_programming

Authored by: Ben Goodman (ChE 345 Spring 2016) 
Steward: Dajun Yue and Fenqi You 
 

=Introduction=
Sequential quadratic programming (SQP) is a class of algorithms for solving non-linear optimization problems (NLP) in the real world. It is powerful enough for real problems because it can handle any degree of non-linearity including non-linearity in the constraints. The main disadvantage is that the method incorporates several derivatives, which likely need to be worked analytically in advance of iterating to a solution, so SQP becomes quite cumbersome for large problems with many variables or constraints. The method dates back to 1963 and was developed and refined in the 1970's .[1] SQP combines two fundamental algorithms for solving non-linear optimization problems: an active set method and Newton’s method, both of which are explained briefly below. Previous exposure to the component methods as well as to Lagrangian multipliers and Karush-Kuhn-Tucker (KKT) conditions is helpful in understanding SQP. The abstracted, general problem below will be used for the remainder of this page to explain and discuss SQP: 
<math> \text{min f(x)} </math> 
<math> \text{s.t. h(x) = 0} </math> 
<math> \text{and g(x)} \le 0 </math> 
 
with f(x), h(x), and g(x) each potentially non-linear. <math>x</math> is potentially a vector of many variables for the optimization, in which case h(x) and g(x) are systems. 
 

=Background: Prerequisite Methods=
==Karush-Kuhn-Tucker (KKT) Conditions and the Lagrangian Function==
The Lagrangian function combines all the information about the problem into one function using Lagrangian multipliers <math>\lambda</math> for equality constraints and <math>\mu</math> for inequality constraints:
<math>L(x,\lambda,\mu)</math><math>\text{ = f(x) +}</math><math>\sum_i</math><math>\lambda_i h_i(x)+</math><math>\sum_i</math><math>\mu_i g_i(x)</math> 
 
A single function can be optimized by finding critical points where the gradient is zero. This procedure now includes <math>\lambda</math> and <math>\mu</math> as variables (which are vectors for multi-constraint NLP). The system formed from this gradient is given the label KKT conditions: 
 
<math>\nabla L =</math><math>\begin{bmatrix} \frac{dL}{dx} \\ \frac{dL}{d\lambda} \\ \frac{dL}{d\mu} \end{bmatrix} =</math>
<math>\begin{bmatrix} \nabla f + \lambda \nabla h + \mu \nabla g^* \\ h \\ g^* \end{bmatrix} </math><math>=0</math> 
 
The second KKT condition is merely feasibility; h(x) were constrained to zero in the original NLP. The third KKT condition is a bit trickier in that only the set of active inequality constraints need satisfy this equality, the active set being denoted by <math>g^*</math>. Inequality constraints that are nowhere near the optimal solution are inconsequential, but constraints that actively participate in determining the optimal solution will be at their limit of zero, and thus the third KKT condition holds. Ultimately, the Lagrangian multipliers describe the change in the objective function with respect to a change in a constraint, so <math>\mu</math> is zero for inactive constraints, so those inactive constraints can be considered removed from the Lagrangian function before the gradient is even taken. 
 

==The Active Set Method and its Limitations==
The active set method solves the KKT conditions using guess and check to find critical points. Guessing that every inequality constraints is inactive is conventionally the first step. After solving the remaining system for <math>x</math>, feasibility can be checked. If any constraints are violated, they should be considered active in the next iteration, and if any multipliers are found to be negative, their constraints should be considered inactive in the next iteration. Efficient convergence and potentially large systems of equations are of some concern, but the main limitation of the active set method is that many of the derivative expressions in the KKT conditions could still be highly non-linear and thus difficult to solve. Indeed, only quadratic problems seem reasonable to tackle with the active set method because the KKT conditions are linear. Sequential Quadratic Programming addresses this key limitation by incorporating a means of handling highly non-linear functions: Newton's Method. 
 

==Newton's Method==
The main idea behind Newton's Method is to improve a guess in proportion to how quickly the function is changing at the guess and inversely proportional to how the function is accelerating at the guess. Walking through a few extreme scenarios makes this approach more intuitive: a long, steep incline in a function will not be close to a critical point, so the improvement should be large, and a shallow incline that is rapidly expiring is likely to be near a critical point, so the improvement should be small. The iterations converge to critical values of any function <math>f</math> with improvement steps that follow the form below: 
<math>x_{k+1} =</math> <math> x_k - </math> <math>\frac{\nabla f}{\nabla^2 f} </math> 
 
The negative sign is important. Near minimums, a positive gradient should decrease the guess and vice versa, and the divergence is positive. Near maximums, a positive gradient should increase the guess and vice versa, but the divergence is negative. This sign convention also prevents the algorithm from escaping a single convex or concave region; the improvement will reverse direction if it overshoots. This is an important consideration in non-convex problems with multiple local maximums and minimums. Newton's method will find the critical point closest to the original guess. Incorporating Newton's Method into the active set method will transform the iteration above into a matrix equation. 
 

=The SQP Algorithm=
Critical points of the objective function will also be critical points of the Lagrangian function and vice versa because the Lagrangian function is equal to the objective function at a KKT point; all constraints are either equal to zero or inactive. The algorithm is thus simply iterating Newton's method to find critical points of the Lagrangian function. Since the Lagrangian multipliers are additional variables, the iteration forms a system: 
<math> \begin{bmatrix} x_{k+1} \\ \lambda_{k+1} \\ \mu_{k+1} \end{bmatrix} =</math>
<math> \begin{bmatrix} x_{k} \\ \lambda_{k} \\ \mu_{k} \end{bmatrix} -</math><math> (\nabla^2 L_k)^{-1} \nabla L_k</math> 
 

Recall: <math>\nabla L =</math><math>\begin{bmatrix} \frac{dL}{dx} \\ \frac{dL}{d\lambda} \\ \frac{dL}{d\mu} \end{bmatrix} =</math>
<math>\begin{bmatrix} \nabla f + \lambda \nabla h + \mu \nabla g^* \\ h \\ g^* \end{bmatrix} </math> 
 
Then <math>\nabla^2 L = </math><math>\begin{bmatrix} \nabla_{xx}^2 L & \nabla h & \nabla g \\ \nabla h & 0 & 0 \\ \nabla g & 0 & 0 \end{bmatrix} </math> 
 
Unlike the active set method, the need to ever solve a system of non-linear equations has been entirely eliminated in SQP, no matter how non-linear the objective and constraints. Theoretically, If the derivative expressions above can be formulated analytically then coded, software could iterate very quickly because the system doesn't change. In practice, however, it is likely that the divergence will not be an invertible matrix because variables are likely to be linearly bound from above and below. The improvement direction "p" for the Newton's Method iterations is thus typically found in a more indirect fashion: with a quadratic minimization sub-problem that is solved using quadratic algorithms. The subproblem is derived as follows: 
 
<math> p = \frac{\nabla L}{\nabla^2 L} = \frac{(\nabla L)p}{(\nabla^2 L)p}</math> 
 
Since p is an incremental change to the objective function, this equation then resembles a two-term Taylor Series for the derivative of the objective function, which shows that a Taylor expansion with the increment p as a variable is equivalent to a Newton iteration. Decomposing the different equations within this system and cutting the second order term in half to match Taylor Series concepts, a minimization sub-problem can be obtained. This problem is quadratic and thus must be solved with non-linear methods, which once again introduces the need to solve a non-linear problem into the algorithm, but this predictable sub-problem with one variable is much easier to tackle than the parent problem. 
<math>\text{min(p) } f_k(x) + \nabla f_k^T p + \frac{1}{2}p^T\nabla_{xx}^2L_k p</math> 
<math>\text{s.t. } \nabla h_k p + h_k = 0</math> 
<math>\text{ and } \nabla g_k p + g_k = 0</math> 
 

==Example Problem==
[[File:Wiki_Inspection.jpg|frame|Figure 1: Solution of Example Problem by Inspection.10]]

<math> \text{ Max Z = sin(u)cos(v) + cos(u)sin(v)}</math> 
<math> \text{ s.t. } 0 \le \text{(u+v)} \le \pi</math> 
<math> \text{ } u = v^3 </math> 
 

This example problem was chosen for being highly non linear but also easy to solve by inspection as a reference. The objective function Z is a trigonometric identity: 
<math> \text{ sin(u)cos(v) + cos(u)sin(v) = sin(u+v)}</math> 
 
The first constraint then just restricts the feasible zone to the first half of a period of the sine function, making the problem convex. The maximum of the sine function within this region occurs at <math>\frac{\pi}{2}</math>, as shown in Figure 1. The last constraint then makes the problem easy to solve algebraically: 
<math> u + v = \frac{\pi}{2}</math> 
<math> u = v^3 </math> 
<math>v^3 + v = \frac{\pi}{2}</math> 
<math>v = 0.883</math> and <math>u = v^3 = 0.688</math> 
 
Now, the problem will be solved using the sequential quadratic programming algorithm. The Lagrangian funtion with its gradient and divergence are as follows: 
<math> L = sin(u)cos(v) + cos(u)sin(v) + \lambda_1 (u - v^3) + \mu_1 ( -u - v) + \mu_2 (u + v - \pi)</math> 
 

<math>\nabla L =</math><math>\begin{bmatrix} \frac{dL}{du} \\ \frac{dL}{dv} \\ \frac{dL}{d\lambda_1} \\ \frac{dL}{d\mu_1} \\ \frac{dL}{d\mu_2} \end{bmatrix} =</math>
<math>\begin{bmatrix} cos(v)cos(u) - sin(u)sin(v) + \lambda_1 - \mu_1 + \mu_2 \\ cos(v)cos(u) - sin(u)sin(v) - 3\lambda_1v^2 - \mu_1 + \mu_2 \\ u - v^3 \\ -u - v \\ u + v - \pi \end{bmatrix} </math> 
 

<math>\nabla^2 L = </math><math>\begin{bmatrix} -Z & -Z & 1 & -1 & 1 \\ -Z & -(Z + 6\lambda_1v) & -3v^2 & -1 & 1 \\ 1 & -3v^2 & 0 & 0 & 0 \\ -1 & -1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 & 0 \end{bmatrix} </math>
with <math>Z = sin(u)cos(v) + cos(u)sin(v)</math> 
 

[[File:Code_for_Wiki2.JPG|frame|Figure 2: MATLAB program for performing sequential Newton steps on quadratic subproblem.10]]

The first important limitation in using SQP is now apparent: the divergence matrix is not invertible because it is not full rank. We will switch to the quadratic minimization sub-problem above, but even with this alternate framework, the gradient of the constraints must be full rank. This can be handled for now by artificially constraining the problem a bit further so that the derivatives of the inequality constraints are not linearly dependent. This can be accomplished through a small modification to the <math>\mu_2</math> constraint. 
If <math> u + v \le \pi</math>, then it is also true that <math> u + v + exp(-7u) \le \pi</math> 
 
The addition to the left-hand-side is relatively close to zero for the range of possible values of <math>u</math>, so the feasible region has not changed much. This complication is certainly annoying, however, for a problem that's easily solved by inspection. It illustrates that SQP is truly best for problems with highly non-linear objectives and constraints. Now, the problem is ready to be solved. The MATLAB code in figure two was implemented, using the function fmincon to solve the minimization subproblems. fmincon is itself an SQP piece of software. In each step, the incumbent guess is plugged into the gradient, hessian, and constraint arrays, which then become parameters for the minimization problem. 
 

=Conclusion=
SQP is powerful enough to be used in commercial software but also burdened by some intricacy. In addition to the complication from needing full-rank constraint gradients, the divergence matrix can be very difficult or laborious to assemble analytically. Commercial SQP packages include checks for the feasibility of the sub-problem in order to account for rank deficiencies. In addition to fmincon, SNOPT and FILTERSQP are two other commercial SQP packages, and each uses a different non-linear method to solve the quadratic subproblem.[1] [https://optimization.mccormick.northwestern.edu/index.php/Line_search_methods Line search methods] and [https://optimization.mccormick.northwestern.edu/index.php/Trust-region_methods trust-region methods] are trusted options for this step, and sub-gradient methods have also been proposed. The other common modification to SQP (ubiquitous to commercial packages) is to invoke [https://optimization.mccormick.northwestern.edu/index.php/Quasi-Newton_methods quasi-Newton methods] in order to avoid computing the Hessian entirely. SQP is thus very much a family of algorithms rather than a stand-alone tool for optimization. At its core, it is a method for turning large, very non-linear problems into a sequence of small quadratic problems to reduce the computational expense of the problem.

=Sources=
[1] Nocedal, J. and Wright, S. Numerical Optimization, 2nd. ed., Ch. 18. Springer, 2006. 
[2] You, Fengqi. Lecture Notes, Chemical Engineering 345 Optimization. Northwestern University, 2015.

File:Code for Wiki2.JPG

2022-04-01T15:42:14Z

Btantisujjatham:

Branch and cut for MINLP

2022-04-01T15:41:12Z

Btantisujjatham:

This web page is a duplicate of https://optimization.mccormick.northwestern.edu/index.php/Branch_and_cut_for_MINLP

Author: Pear Dhiantravan (ChE 345 Spring 2015)

Stewards: Dajun Yue, Fengqi You

Branch and Cut is a method for solving Mixed Integer Non Linear Programming (MINLP) models. There are variations on the exact steps of the algorithm, but the original method developed by Ioannis Akrotirianakis, Istvan Maros, and Berc Rustem in 2000 utilizes the general organization of the Branch and Bound method with the iterative nature of the Outer Approximation.1 The addition of Gomory mixed integer cuts as an alternative to branching improves the efficiency of the overall branch and bound algorithm by reducing the number of nodes and Non-Linear Programming (NLP) problems required to solve the MINLP.

=Introduction=
The organization of general design problems into programming models allows for the defining and finding of their (global) optimal solution. MINLP models represent problems as a sets of continuous variables with binary integer variables. The continuous variables are restricted to defined constraints, and the binary variables represent whether or not a design choice is made.

===History===
[[File:B&C_minlp.JPG|500px|thumb|Branch and Bound with Gomory Cuts and the iterative nature of Outer Approximations 2]]

Mixed integer programming problems were explored most generally in the later half of the 1900s. Ignacio Grossmann published many papers on mixed integer linear programing problems and their solutions and applications in the late 1980s through the 1990s. In those decades, research was focused largely on improving the [https://optimization.mccormick.northwestern.edu/index.php/Branch_and_bound_%28BB%29 Branch and Bound] method to achieve greater time efficiency. The branch and bound method was developed by A.H. Land and A.G. Doig in 1960 to solve linear integer problems,3,1 and expanded in 1965 by R.J. Dakin to incorporate nonlinear integer problem solutions. 4,1 Branch and bound seemed to be a robust method, however the time required to solve increases exponentially as the number of variables increases.5 In the 1950s and 60s, Ralph Gomory developed a different method using cutting planes to solve integer problems. Gomory Cuts were quicker to solve but gave less reliable solutions. The exploration of incorporating Gomory cuts into the branch and bound method gained popularity when solutions showed greater efficacy upon combination of these methods. The joint algorithm is called Branch and Cut.

In the early 2000s, Akrotirianakis, Maros, and Rustem built on the findings of their predecessors to expand the application of the branch and cut method to non linear programs. More research surfaced on broader applications of this method, and with this grew the development of computational programs such as CPLEX and XPRESS-MP to run and solve these problems using given iterative solution methods.6

===Applications===
Mixed integer non linear programming models can cover a wide variety of engineering and design problems with discrete variable values.

The Branch and Cut method is well-known for its long-time use in solving the [https://optimization.mccormick.northwestern.edu/index.php/Traveling_salesman_problems Traveling Salesman Problem].6 In 1991, Padberg and Rinaldi developed a solution to large-scale symmetric traveling salesman problems using the branch and cut method.7 MINLP problems involving scheduling, network design (including facility locating), and a variety of simplified biological models can be solved using the branch and cut algorithm given integer-valued variables.6

=Formulation=
Given that <math>f(x)</math>, <math>g(x)</math>, and <math>h(x)</math> are convex and continuously differentiable, a general MINLP problem solvable by the Branch and Cut method can be formulated as follows:

<math>min Z(x,y)</math> <math>=</math> <math>C^T y</math> <math>+ f(x)</math>

<math>s.t.</math> <math>g(x) + By \leqslant 0</math>

<math> h(x) = 0</math>

<math> x \in X, y \in {0,1}</math>

=Algorithm=
The Branch and Cut algorithm closely follows that of the Outer Approximation for MINLP, but incorporates Gomory cuts in order to decrease the search space of the problem.

The algorithm presented here is the method developed by Akrotirianakis, Maros, and Rustem [1a]. Variations on this method are described below.

===Initialization: Value of the Search Tree Root===
0. Upper bound = inf, Lower bound = -inf

1. Initialize <math>y^0</math>
The Branch and Cut method begins with an initialization of the binary variables <math>(y)</math> by defining each with a finite starting value 0 or 1.
Otherwise, the initialization may be achieved by solving a relaxed NLP setting <math>0\leqslant y \leqslant 1</math>

2. Solve NLP using <math>y^0</math>
'''if''' NLP is feasible
→ solve <math>Z(x^0, y^0)</math>. This yields a new '''upper bound''' to the original problem. Move to step 3
'''else''' NLP is not feasible
→ if an initial set of values for <math>y^0</math> was chosen, choose a new <math>(y^0)</math>.
Repeat 1 and 2 until a feasible solution is obtained. Move to step 3.

The NLP should result in an initial feasible solution <math>x^0</math> and <math>y^0</math> in which <math>Z(x^0, y^0)</math> <math>\leq \infty</math>.

Otherwise, if the relaxed NLP is infeasible, then the MINLP problem is infeasible.

3. Solve MILP

Using <math>x^0</math> and <math>y^0</math> obtained from the NLP, solve the linearized objective and constraints according to the following equations:

<math>min</math> <math>a</math>

<math>s.t.</math> <math>a \leqslant UB^0 </math>

<math>f(x^0,y^0) +</math> <math>\nabla\ f(x^0,y^0)^T</math> <math>\binom{x-x^0}{y-y^0} </math> <math>\leqslant</math> <math>a</math>

<math>g(x^0,y^0) +</math> <math>\nabla\ g(x^0,y^0)^T</math> <math>\binom{x-x^0}{y-y^0} </math> <math>\leqslant</math> <math>0</math>

<math>y \in</math> {0,1}

The solution to the MILP will yield a new set of <math>(x)</math> and <math>(y)</math>. Call these the new <math>(x^0)</math> and <math>(y^0)</math>; this is the '''root''' of the branch and bound search tree.

===Solve Nodes===
4. Select an unsolved node
'''If''' no nodes are left,
'''then''' the problem is solved and <math>Z(x^j, y^j)</math> is the optimal solution.

5. Solve MILP (formulation below)
'''If''' all <math>y^k</math> = {0,1}
This node is the new incumbent
'''and''' Z obtained from this set of <math>x^k</math> and <math>y^k</math> is the current optimal solution.
'''and''' add linearizations of the objective and constraints
using <math>x^k</math> and <math>y^k</math> to each unsolved node according to the equation below.
Go to step 6.
'''Else''' a <math>y^k</math> ∉ {0,1}
Go to step 7.

Solve the corresponding MILP of the problem using the previous feasible solution <math>(x^j, y^j)</math> in the following form:

For the first node (excluding the root), <math>(x^j, y^j)</math> will be <math>(x^0, y^0)</math>, etc.

<math>min</math> <math>a</math>

<math>s.t.</math> <math>a \leqslant UB^j </math>

<math>f(x^j,y^j) +</math> <math>\nabla\ f(x^j,y^j)^T</math> <math>\binom{x-x^j}{y-y^j} </math> <math>\leqslant</math> <math>a</math>

<math>g(x^j,y^j) +</math> <math>\nabla\ g(x^j,y^j)^T</math> <math>\binom{x-x^j}{y-y^j} </math> <math>\leqslant</math> <math>0</math> <math>\forall j \in T</math>

where <math>T</math> is the set containing all indices of the solution tree in which the NLP for <math>y^j</math> is feasible.

<math>g(x^i,y^i) +</math> <math>\nabla\ g(x^i,y^i)^T</math> <math>\binom{x-x^i}{y-y^i} </math> <math>\leqslant</math> <math>0</math> <math>\forall i \in S</math>

where <math>S</math> is the set containing all indices of the solution tree in which the NLP for <math>y^i</math> is infeasible.

<math>\Gamma x + \Gamma y \leqslant \epsilon</math>

This constraint represents the Gomory cuts generated as the algorithm runs. This line decreases the feasible region of the original problem.

<math>y \in</math> {0,1}

The solution to this MILP will yield a new set <math>(x^k)</math> and <math>(y^k)</math>.

6. Solve relaxed NLP
Using the MILP solution <math>y^k</math>
This should result in a new solution <math>x^j</math> and <math>y^j = y^k</math>.
'''If''' NLP<math>(y^j)</math> is feasible
→ solve <math>Z(x^j, y^j)</math>.
'''If''' <math>Z(x^j, y^j)</math> is less than the previous upper bound
'''then''' this yields a new '''upper bound''' <math>UB^j</math> to the original problem.
Fathom all unsolved nodes that do not satisfy this bound.
Return to step 4
'''Else''' retain the old upper bound.
Return to step 4
'''Else''' NLP<math>(y^j)</math> is not feasible
→ Fathom node.
Return to step 4

===Cutting vs Branching===
When the solution to the linear relaxation in Step 5 contains fractional values on binary variables <math>y</math>, either a cutting plane should be generated to exclude that solution or branching can be done at that node to constrain the fractional variable to its binary values 0 or 1.

7. Calculate "skip factor"
'''If''' a cut is to be performed, go to step 8.
'''Else''' the cut is to be skipped, go to step 9.

The decision whether or not to perform a Gomory cut is based on a "skip factor" <math>S</math> which is calculated throughout the enumeration of the tree.

<math>S =</math> <math>min[S_{max}</math>, <math>\dfrac{t + wcd \log_{10} p}{t f}]</math>

where t = number of integer nodes solved

S_{max}, c, w = positive constant parameter

p = number of binary variables y

f = number of violated binary variables in <math>y^k</math>

===Gomory Cuts===
8. Generate an inequality to exclude the fractional variable solution

<<Equation 3. for gomory mixed integer cuts>>

Add these cuts to the constraints of the original problem.

Return to step 5

===Branching===
9. Constrain the/a variable in <math>y^k</math> to branch off of.

One branch will set <math>y^k = 0</math> and the other <math>y^k = 1</math>, creating two new nodes. Return to step 5, following one of these nodes.

=Variations=
A variation on the Branch and Cut method presented by Sven Leyffer in 2013 utilizes cuts for every fractional solution and branches only when solutions are still fractional after multiple cuts. 8 Thus, the algorithm is the same until step 7, where cuts are always chosen to eliminate solutions where <math>y^k</math> ∉ {0,1}. The inequality generated to produce this cut is added to the constraints of the original problem.

=Conclusion=
The general Branch and Cut algorithm follows a scheme which grows exponentially as variables are added to the original design problem. The addition of Gomory Cuts allows bounds to be tightened, reducing the number of feasible nodes to be tested and allowing the solver to converge on the optimal solution with less iterations. However, the formation of inequalities to generate these cuts take time and may slow down the actual resolution of the MINLP problem. The larger the problem, the more useful the cuts can be.

=References=

# I. Akrotirianakis, I. Maros, and B. Rustem. An Outer Approximation Based Branch and Cut Algorithm for convex 0-1 MINLP Problems. ''Optimization Methods and Software.'' 21:47, 2001 https://www.doc.ic.ac.uk/research/technicalreports/2000/DTR00-6.pdf
# S. Leyffer and J. Linderoth. Introduction to Integer Nonlinear Optimization, Nonlinear Branch-and-Cut, Theoretical and Computational Challenges. ''Argonne National Laboratory.'' 2007. http://science.energy.gov/~/media/ascr/pdf/workshops-conferences/mathtalks/Leyffer.pdf
# A. H. Land and A. G. Doig. An Automatic method of solving discrete programming problems. ''Econometrica,'' 28:497 520, 1960.
# R. J. Dakin. A tree search algorithm for mixed integer probramming problems. ''Computer Journal,'' 8:250 255, 1965.
# S. Albert. Solving Mixed Integer Linear Programs Using Branch and Cut Algorithm. ''Masters of Mathematics, North Carolina State University.'' 2006. http://www4.ncsu.edu/~kksivara/masters-thesis/shon-thesis.pdf (good for milp)
# J. E. Mitchell. Branch-and-Cut Algorithms for Combinatorial Optimization Problems. ''Mathematical Sciences,'' 1999. http://homepages.rpi.edu/~mitchj/papers/bc_hao.pdf
# M. Padberg and G. Rinaldi. A Branch-and-Cut Algorithm for the Resolution of Large-Scale Symmetric Traveling Salesman Problems. ''SIAM Rev,'' 33(1), 60-100, 1991. http://epubs.siam.org/doi/abs/10.1137/1033004?journalCode=siread
# S. Leyffer. Mixed-Integer Nonlinear Optimization: Applications, Algorithms, and Computation III. ''Graduate School, Universite catholique de Louvain.'' 2013. https://wiki.mcs.anl.gov/leyffer/images/4/42/Socn-3.pdf

Generalized Benders decomposition (GBD)

2022-04-01T15:38:04Z

Btantisujjatham:

This web page is a duplicate of https://optimization.mccormick.northwestern.edu/index.php/Generalized_Benders_decomposition_(GBD)

Authors: Yuanxi Zhao (ChE 345 Spring 2015) 
Steward: Dajun Yue, Fengqi You 
Date Presented: May 22, 2015 

Generalized Benders Decomposition is a procedure to solve certain types of NLP and MINLP problems. The use of this procedure has been recently suggested as a tool for solving process design problems. While analyzing the solution of nonconvex problems through different implementations of the GBD, it is demonstrated that in certain cases only local minima may be found, whereas in other cases not even convergence to local optima can be achieved. 

=Introduction=
J.F. Benders devised an approach for exploiting the structure of mathematical programming problems with complicating variables (variables which, when temporarily fixed, render the remaining optimization problem considerably more tractable).The algorithm he proposed for finding the optimal value of this vector employs a cutting-plane approach for building up adequate representations of (i) the extremal value of the linear program as a function of the parameterizing vector and (ii) the set of values of the parameterizing vector for which the linear program is feasible. Linear programming duality theory was employed to derive the natural families of cuts characterizing these representations, and the parameterized linear program itself is used to generate what are usually deepest cuts for building up the representations. 

Geoffrion (1972) generalized the approach proposed by Benders (1962) to a broader class of optimization problems in which the parametrized subproblem need no longer be a linear program. Nonlinear convex duality theory is employed to derive the natural families of cuts corresponding to those in Bender's case.

=Model Formulation=
===Target problem===

<math>min_{x,y}</math><math>f(x,y)</math> 
<math>s.t.</math> <math>g(x) \leq 0 </math> <math>(1)</math> 
<math>x\in X</math>,<math>y\in Y</math> 

under the following assumptions: 
# <math> X \subset R</math> and <math>Y \subset R</math> are compact. 
# For all <math> y\in Y</math>,<math>f</math> and <math>g</math> are convex on <math>X</math>. 
# For <math>y</math> fixed to any feasible <math> y\in Y</math>, the problem satisfies Slater's condition. 

The following problem is equivalent to above and is called its projection on <math>Y</math> 

<math>min_y</math><math> v(y)</math> 
<math>s.t. </math> <math>v(y)=</math><math>min_{x \in X}f(x,y)</math> <math>s.t.</math><math> g(x,y) \leq 0</math> <math> (2)</math> 
<math>y\in Y\cap V</math> 

where <math>V</math><math> =</math>{<math>y: g(x,y)\leq 0 </math> for some <math>x</math><math>\in</math><math>X</math>}. 
For problems where <math>X</math> is a convex set, and the functions <math>f(x,y)</math>,<math>g(x,y)</math> are convex with respect to the variable <math>x</math>, Geooffrion (1972) proposed a decomposition of (2) based on the following two problems: 

===Primal problem===
<math>min_x</math><math>f(x,\bar{y})</math> 
<math>s.t. </math> <math>g(x,\bar{y})\leq 0</math> <math>(3)</math> 
<math>x\in X</math> 

where <math>\bar{y}</math> is an arbitrary but fixed point in <math>Y</math>.

===Master problem===
<math>min_{y\in Y}</math><math>[</math><math>max</math><math>_{u\geq 0}</math><math>[min_{x\in X}</math>{<math>f(x,y)</math><math>+u^T</math><math>g(x,y)</math>}<math>]]</math> 
<math>s.t.</math> <math>min_{\lambda}</math>{<math>{\lambda}^T</math><math>g(x,y)</math>}<math>\leq 0</math>, <math>\forall</math><math>{\lambda}</math><math>\in</math><math>{\Lambda}</math> <math> (4)</math>

where <math>{\Lambda}=</math>{<math>{\lambda}\geq 0;</math><math>\sum_i</math><math>{\lambda}</math><math>_i=1</math>} <math>and</math> <math>y>\in Y</math>, <math>x\in X</math>, <math>u\geq 0</math>

===Some Remark===

Before any further analysis, it is appropriate to point out that the master problem is equivalent to the projection (2), only when <math>X</math> is a convex set, and the functions <math>f(x,y),g(x,y)</math>are convex with respect to <math>x</math>. This is so because the dual of <math>v(y)</math>, as defined in (2), was invoked in arriving at the Master problem. Therefore, when <math>X</math> is not a convex set and/or convexity of the functions <math>f(x,y)</math> and <math>g(x,y)</math> in the variable <math>x</math>does not hold, a dual gap may exist between <math>v(y)</math> and its dual. In these cases, since by the weak duality theorem, the solution of <math>v(y)</math> is always greater than or equal to the solution of its dual, the master problem can only provide a lower bound on the optimum, i.e. 

<math>v(y)=</math><math>min_{x\in X}</math><math> f(x,y)</math><math> s.t.</math><math>g(x,y)\leq 0</math> 
<math>v(y)</math><math>\geq</math><math> max</math><math>_{u\geq 0}</math><math>[min_{x\in X}</math>{<math>f(x,y)</math><math>+ u^T g(x,y)</math>}<math>]</math> 
<math>\Rightarrow</math> 
<math>min_{y\in Y\cap V}</math><math>v(y)\geq</math><math>min_{y\in Y\cap V}</math><math>[max</math><math>_{u\geq 0}</math><math>[min_{x\in X}</math><math>f(x,y)</math><math>+u^T g(x,y)</math>}<math>]]</math> 
<math>min_{y\in Y\cap V}</math><math>v(y)</math><math>=min_{y\in Y}</math><math>[max</math><math>_{u\geq 0}</math><math>[min_{x\in X}</math><math>f(x,y)</math><math>+u^T g(x,y)</math>}<math>]]</math> 
<math>s.t.</math> <math>min_{x\in X}</math>{<math>{\lambda}^T g(x,y)</math>}<math>\leq 0</math>, <math>\forall</math><math>{\lambda}</math><math>\in</math><math>{\Lambda}</math> 

=Solution approaches=
===Algorithm flowchart===
[[File:Algorithm.png|800px|Algorithm flowchart]]
===Computational implementation===
The master problem can be rewritten as 

<math>min_{y_0=y} y_0</math> 
<math>s.t.</math> <math>L^*(y,u)</math><math>=min_{x\in X}</math>{<math>f(x,y)</math> <math>+u^Tg(x,y)</math>}<math>\leq y_0</math>, <math>all</math><math>u\geq 0</math>
<math>L_*(y,{\lambda})</math><math>=min_{x\in X}</math>{<math>{\lambda}^Tg(x,y)</math>}<math>\leq 0</math>, <math>all</math><math>{\lambda}\in {\Lambda}</math> 

Geoffrion (1972) suggests to solve a relaxed version of this problem in which all but a few constraints are ignored. Since constraints are continuously added to the master, the optimal values of this problem form a monotone nonedecreasing sequence. When <math>f(x,y)</math> and <math>g(x,y)</math> are convex in <math>x</math> and certain conditions hold, the lowest solution of the primal and the global solution of the master [as defined from above] will approach one another and will thus provide the global optimum of the overall problem within a prespecified tolerance. As analyzed earlier, when convexity in <math>x</math> does not hold, dual gaps may prevent these two values from approaching each other. Nevertheless the global solution of the relaxed master will still provide a valid lower bound on the global optimum of the overall problem. The resulting computational procedure is the following: 

'''Step 1.''' Let a point <math>\bar y</math> in <math>Y\cap V</math> be known. Solve the primal problem and obtain the optimal solution <math>x^*</math>, and the optimal multiplier vector <math>u^*</math>. Put the counters <math>K^f=1</math>,<math>K^i=0</math>. Set <math>U=f(x^*,\bar y)</math>, select a tolerance <math>\varepsilon > 0</math> and put <math>u^{(1)}=u^*</math>. Finally, determine the function <math>L^*(y,u^{(1)})</math>. 

'''Step 2.''' Solve globally the current relaxed master problem: 

<math>min_{y_0=y}</math><math>y_0</math> 
<math>s.t.</math> <math>L^*</math><math>(y,</math><math>u^{(k_1)})</math><math>\leq y_0</math>, <math>k_1=1,\dots</math><math>K^f</math> 
<math>L_*</math><math>(y,</math><math>{\lambda}^{(k_2)})</math><math>\leq 0</math>, <math>k_2=1,\dots</math><math>K^i</math> 

'''Step 3.''' Solve globally the primal problem using <math>\bar {y}=\hat {y}</math>. 
'''Step 3a.''' ''Primal is feasible''. If <math>v(\bar{y})</math><math>\leq \hat{y_0}</math><math>+\epsilon</math> terminate. Otherwise, determine the optimal multiplier vector<math> u^*</math>,increase <math>K^f</math> by one, set <math>u^{(k^f)}</math><math>=u^*</math>.Additionally, if <math>u(\bar{y})</math><math><U</math>,put <math>U=v(\bar{y})</math>. Finally, determine the function <math>L^*(y</math><math>,u^{(k^f)})</math> and return to Step 2. 
'''Step 3b.''' ''Primal is infeasible''. Determine a set of values of <math>{\lambda}^*</math><math>\in {\Lambda}</math> which satisfy

<math>min_{x \in X}</math>{<math>{\lambda}</math><math>^{*T}g(x,y)</math>}<math>>0</math> 

Increase <math>K^i</math>by one, put <math>{\lambda}</math><math>^{(K^i)}</math><math>={\lambda}^*</math> and determine the function <math>L*(y,</math><math>{\lambda}</math><math>^{(K^i)}</math><math>)</math>. Return to Step 2. 

When referring to "solutions" of nonconvex optimization problems it is sometimes understood that these may be locally, rather than globally, optimal points. In the above computational procedure it is necessary that the master be solved globally. When problems are jointly convex in <math>x</math> and <math>y</math>, the master is convex and globality is then achieved. Global solutions of the primal are also needed if convexity of X and/or convexity of <math>f(x,y)</math> and/or <math>g(x,y)</math> in <math>x</math>does not hold. 

===Determination of lambda*===
The determination of <math>{\lambda}^*</math> in Step 3b can be done by any Phase I algorithm. In particular Floudas et al (1989) proposed to solve the following problem: 

<math>min_{\alpha,x}</math><math>{\alpha}</math> 
<math>s.t.</math> <math>g(x,\bar{y})</math><math>-{\alpha}\underline{1}</math><math>\leq 0</math> 
<math>x\in X</math><math>,{\alpha}\in R</math> 

where <math>\underline{1}</math><math>=(1,1,</math><math>\dots</math><math>,1)^T</math>. 
Since the primal problem is infeasible, it is already known that the solution of this problem is positive. Once this problem is solved and a stationary point <math>x^*</math> is obtained, the following necessary Kuhn-Tucker conditions are satisfied: 
<math>1-\sum_i</math><math>\lambda_i^*</math><math>=0</math> 
<math>\lambda^{*T}\triangledown_x</math><math>g(x^*,\bar{y})=0</math> 
<math>\lambda_i^*</math><math>[g_i(x^*,\bar y)</math><math>-\alpha]=0</math><math>,\forall_i</math> 
<math>\lambda_i</math><math>\leq 0</math><math>,\forall_i</math> 

From them it is concluded that by solving the problem the condition <math>\lambda^*</math><math>\in\Lambda</math> is satisfied. Note that in this step it is not imperative to achieve globality.

===Explicit determination of L===
In most cases, the function <math>L^*</math> and <math>L_*</math> are implicitly defined. One case in which these functions can be obtained in explicit form is when the global minimum over <math>x</math> can be obtained independently of <math>y</math>. i.e. when Property (P) is satisfied. Two examples in which this property is satisfied arise when the functions <math>f(x,y)</math> and <math>g(x,y)</math> are separable in <math>x</math> and <math>y</math>, and in the variable factor programming case. 
Once Property (P) holds, evalution of <math>L^*(y,u^*)</math><math>,L_*(y,u^*)</math> simply requires that the minima in the master problem be global. This is the so called L-Dual-Adequacy property, which upon satisfaction of Property (P) can be achieved only if a global search of the solution of both 

<math>min_{x\in X}</math>{<math>f(x,y)</math><math>+u^{*T}g(x,y)</math>} <math>and</math> <math>min_{x\in X}</math>{<math>\lambda^{*T}</math><math>g(x,y)</math>} 

is conducted. 
Another case in which these functions can be obtained in explicit form is when Property (P') is satisfied, i.e. when the globally optimal solution of the primal, <math>x^*</math>, is also the solution of the minimization of the problems defined in the master problem. This is guaranteed when <math>f(x,y)</math> and <math>g(x,y)</math> are convex and separable in x. 

If Property (P') is simply assumed, then this implementation of Generalized Benders Decomposition may only be used, without guarantees, to identify candidates for local, but not global minima. 

The above discussion does not mean that Properties (P) and/or (P') must always hold. Other procedures may exist in which the Master problem can be solved without using these requirements.

===Advantages and Disadvantages of the algorithm===
The GBD method is similar in nature to the Outer-Approximation method. The difference arises in the definition of the MILP master problem (M-MIP). In the GBD method only active inequalities are considered and the set is <math>x\in X</math> is disregarded. In particular, assume an outer approximation given at a given point <math>(x^k,y^k)</math> , where for a fixed <math>y^k</math> the point <math>x^k</math> corresponds to the optimal solution to the problem. Making use of the Karush-Kuhn-Tucker conditions and eliminating the continuous variables x, the inequalities can be reduced to Lagrangrian cut projected in the y-space. This can be interpreted as a surrogate constraint because it is obtained as a linear combination of these. 

For the case where there is no feasible solution to the problem, if the point <math>x^k</math> is obtained from the feasibility subproblem (NLPF), the feasibility cut projected in y can be obtained using a similar procedure. In this way, the problem (M-MIP) reduces to a problem projected in the y-space. Since the master problem can be derived from the master problem of the Outer-Approximation algorithm in the context of problem. GBD can be regarded as a particular case of the Outer Approximation algorithm. In fact the following property, holds between the two methods: 

'''Property 1.''' ''Given the same set of K subproblems, the lower bound predicated by the relaxed master problem (RM-OA) is greater or equal to the one predicted by the relaxed master problem (RM-GBD).'' 

The above proof follows from the fact that the Lagrangian and feasibility cuts, <math>(LC^k)</math> and <math>(FC^k)</math>, are surrogates of the outer approximations <math>(OA^k)</math>. Given the fact that the lower bounds of GBD are generally weaker, this method commonly requires a large number of cycles or major iterations. As the number of 0-1 variables increases this differences become more pronounced. This is to be expected since only one new cut is generated per iteration. Therefore, user-supplied constraints must often be added to the master problem to strengthen the bounds. As for the OA algorithm, the trade-off is that while is generally predicts stronger lower bounds than GBD, the computational cost for solving the master problem (M-OA) is greater since the number of constraints added per iteration is equal to the number of nonlinear constraints plus the nonlinear objective. 

The following convergence property applies to the GBD method: 

'''Property 2.''' ''If problem (PI) has zero integrality gap, the GBD algorithm converges in one iteration once the optimal is found.'' 

The above property implies that the only case one can expect the GBD method to terminate in one iteration, is when the initial discrete vector is the optimum, and when the objective value of the NLP relaxation of problem (P1) is the same as the objective of the optimal mixed-integer solution. Given the relationship of GBD with the OA algorithm, Property 2 is also inherited by the OA method. 

One further property that relates the OA and GBD algorithms is the following: 

'''Property 3.''' ''The cut obtained from performing one Benders iteration on the MILP master (RM-OA) is equivalent to the cut obtained from the GBD algorithm.'' 

By making use of this property, instead of solving the MILP (RM-OA) to optimality, for instance by LP-based brand and bound, one can generate a GBD cut by simply performing one Benders iteration on the MILP. This property will prove to be useful when deriving a logic-based version of the GBD algorithm. With the above in mind, the advantages and disadvantages of GBD can thus be summarized as below: 

'''Advantages:''' Geoffrion (1972) generalized Benders approach to a broader class of programs in which the parameterized subproblem need no longer be a linear program. Nonlinear convex duality theory was employed to derive the quivalent masster problem. In GBD, the algorithm alternates between the solution of relaxed master problems and convex nonlinear subproblems. It is a helpful tool for solving process design problems. 

'''Disadvantages:''' Some restrictions regarding the convexity and other properties of the function involved were identified. Also, it is found that in certain cases only local minima may be found, whereas in other cases not even convergence to local optima can be achieved. These suggests that the procedure should be used with caution, especially when global optimality is being sought. More specifically, it is demonstrated that the implementation of the procedure guarantees globality when certain properties are satisfied (primal convexity, global solution of the master, Property (P) and L-Dual-Ade-quacy).

=Application=
Generalized Benders decomposition has been applied to a variety of problems that were modeled as mixed integer linear programming (MDLP) or mixed integer nonlinear programming (MINLP) problems. Geoffrion and Graves (1974) were among the first to use the algorithm to solve an MILP model for the design industrial distribution systems. Noonan and Giglio (1977) used Benders decomposition coupled with a successive linearization procedure to solve a nonlinear multiperiod mix integer model for planning the expansion of electric power generation networks. Rouhani et aL (1985) used GBD to solve an MINLP model for reactive source planning in power systems. Floudas and Ciric (1989) applied GBD to their MINLP formulation for heat exchanger network synthesis. EL-Halwagi and Manouthiosakis (1989) developed an MINLP model for the design and analysis of mass exchange networks and suggested the use of GBD for its solution. Sahinidis and Grossman (1990) applied the algorithm to the problem of scheduling multiproduct parallel production lines by solving the corresponding large scale MINLP model. The technique has also been used for solving nonconvex nonlinear programming (NLP) and MINLP problems: Geoffrion (1971) applied it to the variable factor programming problem and Floudas et aL (1989) suggested a Benders-based procedure for searching for the global optimum of nonconvex problems. 

Unfortunately, Benders decomposition has not been uniformly successful in all its application Florial et aL (1976) have noticed that the algorithm often converged very slowly when applied to their MILP model for scheduling the movement of railway engines. Bazaraa and Sherali (1980) observed that a large number of iterations were needed to solve their MILP model for quadratic assignment problems of realistic size. Sahinidis et aL (1989) have found brand and bound to be significantly faster than Benders decomposition for the solution of their multiperiod MILP for long range planning in the chemicla industries. Finally, with respect to the application of GBD to nonconvex problems, in contrast to Geoffrion (1972) and Floudas et aL (1989), Sahinidis and Grossmann (1989) have encountered cases where the application of this technique to their MINLP model for the planning of chemical processes fail to produce the global optima of these problems.

=Sources=
#Geoffrion, A. M., Elements of Large-Scale Mathematical Programming, Management Science, Vol. 16, No. 11, 1970.
#BENDERS, J. F., Partitioning Procedures for Solving ~lixed-Variables Programming Problems, Numerische Mathematik, Vol. 4, 1962.
#WILSON, R. Programming Variable Factors, Management Science, VoI. 13, No. 1, 1966.
#Balas, E., Duality in Discrete Programming: IV. Applications, CarnegieMellon University, Graduate School of Industrial Administration, Report No. 145, 1968.
#Meyer, R., The Validity of a Family of Optimization Methods, SIAM Journal on Control, Vol. 8, No. I, 1970.
#Dantzic, G. B., Linear Programming and Extensions, Princeton University Press, Princeton, New Jersey, 1963.
#Geoffriosi, A. M., Primal Resource-Directive Approaches for Optimizing Nonlinear Decomposable Systems, Operations Research, Vol. 18, No. 3, 1970.
#Hogan, W., Application of a General Convergence Theory for Outer Approximation Algorithms, University of California at Los Angeles, Western Management Science Institute, Working Paper No. 174, 1971.
#Eaves, B. C., and Zangwill, W. I., Generalized Cutting Plane Algorithms, SIAM Journal on Control, Vol. 9, No. 4, 1971.
#Geoffrion, A. M., A New Global Optimization Technique for Gaseous Diffusion Plant Operation and Capital Investment, University of California at Los Angeles, Graduate School of Business Administration, Discussion Paper, 1970.
#Geoffrion A. M., Duality in Nonlinear Programming: £1 Simplified Application-Oriented Development, SIAM Review, Vol. 13, No. 1, 1971.

Nonconvex generalized disjunctive programming (GDP)

2022-04-01T15:37:24Z

Btantisujjatham:

This web page is a duplicate of https://optimization.mccormick.northwestern.edu/index.php/Nonconvex_Generalized_disjunctive_programming_(GDP)

Author: Kaiwen Li (ChE345 Spring 2015) 
Steward: Prof.Fengqi You, Dajun Yue
 
=Introduction=
General disjunctive programming, GDP, is an alternative approach to represent the formulation of traditional Mixed-Integer Nonlinear Programming, solving discrete/continuous optimization problems. By using algebraic constraints, disjunctions and logic propositions, Boolean and continuous variables are involved in the GDP formulation. The formulation process of GDP problem are more intuitive, and the underlying logic structure of the problem can be kept, so that solution can be found more efficiently. GDP has been successfully applied to Process Design and Planning and Scheduling areas. 
 
However, funtions in GDP problem sometimes could be nonconvex. Due to nonconvexites, conventional MINLP algorithms are often trapped in suboptimal solutions. Thus, solutions to nonconvex GDP has been receiving increasing attention. 

=Algorithm=
In the GDP formulation, some functions might be nonconvex, which will lead to a nonconvex GDP problem. Traditional algorithms for MINLP such as Generalized Benders Decomposition (GBD), or Outer Approximation (OA) will fail to provide a global optimum because the solution of the NLP subproblem can be only a local optimum, while the cuts in master problem may not be valid either. Therefore, in order to get a global optimum, we need to introduce special algorithm for nonconvex GBD problems. 
Most of the methods rely on spatial branch and bound method.

==Motivation==
In order to find the best design and operation of a system in many problems in engineering, a set of algebraic expressions are used to reduce the problem with continuous and discrete variables. Normally theses problems lead to Mixed-Integer Nonlinear Programming (MINLP). However, in order to represent the behavior of physical, chemical, biological, financial or social systems accurately, nonlinear expressions are often used and leads to a MINLP with a nonconvex solution space. Instead of reaching a global optimal, this may give rise to local solutions that are suboptimal. Meanwhile, most algorithms for nonconvex problems are just particular implementation of the spatial branch and bound framework. Therefore, in order to find the global optimum of a large-scaled nonconvex MINLP models in reasonable computational time, Grossmann and others proposed nonconvex generalized disjunctive programming method.

==General Formulation for GDP==
Consider the following Generalized Disjunctive Programming problem, which includes Boolean variables, disjunctions and logic propositions: 
<math>min Z</math> <math>=</math> <math>\sum_{k\in K} c_k</math> <math>+ f(x)</math> 
 
s.t. <math>r(x)\le 0</math> 
 
<math>\bigvee_{j\in J_k}</math> <math>\begin{bmatrix} Y_{jk} \\ g_{jk}(x) \le 0 \\ c_k = \gamma_{jk} \end{bmatrix}</math>, <math> k\in K</math> 
 
<math>\Omega(Y) = True </math> 
 
<math>0 \le x \le U</math> 
 
<math>x \in R^n</math>, <math> c_k \in R^1</math>, <math> Y_{jk} \in {true, false}</math> 
 

where, <math>f:R^n \rightarrow R^1</math> is a function of the continuous variables x in the objective function, <math>g:R^n \rightarrow R^1</math> bolongs to the set of global constraints, the disjuctions <math> k\in K</math> are composed of a number of terms <math> j\in J_k</math>, which is connected by the OR operator. <math>Y_{jk}</math> is a Boolean varibale, <math>g_{jk}(x) \le 0</math> is a set of inequalities, and <math>c_k</math> is a cost variable. When <math>Y_{jk}</math> is true, <math>g_{jk}(x) \le 0</math> and <math>c_k</math> are enforced.Also, <math>x</math>represents continuous variables, with lower and upper bounds.Each term in the disjunctions gives rise to a nonempty feasible region which is generally nonconvex. Also, <math>\Omega(Y) = True </math> are logic propositions for the Boolean variables. 

==Overall Procedure==
The following flowchart(Fig.1) shows the overall procedure of the proposed two-level branch and bound algorithm.
[[File:Flowchart01_KL.png|600px]] 
First, introduce convex underestimators in the non-convex GDP problem (P), and construct the underestimating problem (R). This convex GDP problem is then reformulated as the convex NLP problem (CRP) by using the convex hull relaxation of each disjunction, which generates valid lower bound. Initial upper bound is obtained in step 0, by sloving P-MIP problem. It is a MINLP reformulation of the nonconvex GDP by a standard MINLP method. Then, use upper bound for bound contraction to reduce the feasible region, which is solved as a Bound Contraction Problem (BCP) in step 1. Next in step 2, discrete branch and bound method is applied at the first level to slove problem (CRP). After fixing all Boolean variables, solve the corresponding nonconvex NLP problems for a upper bound by using spatial branch and bound at the second level. Then, problem (CRP) is solved with fixed discrete choice in step 3.

==Detailed Formulation and Models==
===Convex Relaxation of GDP===
The following reformulation shows the introduction of valid convex underestimating functions to change Problem (P) into a convex GDP problem. 
 
<math>min Z^R</math> <math>=</math> <math>\sum_{k\in K} c_k</math> <math>+ \bar{f}(x)</math> 
 
s.t. <math>\bar{r}(x)\le 0</math> 
 
<math>\bigvee_{j\in J_k}</math> <math>\begin{bmatrix} Y_{jk} \\ \bar{g}_{jk}(x) \le 0 \\ c_k = \gamma_{jk} \end{bmatrix}</math>, <math> k\in K</math> 
 
<math>\Omega(Y) = True </math> 
 
<math>0 \le x \le U</math> ,<math>c_k \ge 0</math>
 
<math>x \in R^n</math>, <math>c_k \in R^1</math>, <math> Y_{jk} \in \{ true, false\} </math> 
 
where <math>\bar{f}</math>,<math>\bar{r}</math>,<math>\bar{g}</math> are valid convex underestimators so that <math>\bar{f}(x) \le f(x)</math>, <math>\bar{r}(x) \le 0</math>,<math>\bar{g}(x) \le 0</math> are satisfied if <math>r(x) \le 0</math>,<math>g(x) \le 0</math> (see fig.2) 
[[File:Fig02_KL.png|450px]] 
Considering problem (R) is a convex GDP, by replacing each disjunction by its convex hull we can relax problem (R), which generates the folloing convex NLP model: 
 
<math>min Z^L</math> <math>=</math> <math>\sum_{k\in K}\sum_{j\in J_k} \gamma_{jk}\lambda_{jk} + \bar{f}(x)</math> 
 
s.t. <math>\bar{r}(x)\le 0</math> 
 
<math> x = \sum_{j \in J_k} \nu_{jk}, \sum_{j \in J_k} \lambda_{jk} = 1, k\in K</math> 
 
<math> 0 \le \nu_{jk} \le \lambda_{jk}U_{jk}, j \in J_k, k\in K</math> 
 
<math> \lambda_{jk}\bar{g}_{jk}(\nu_{jk}/\lambda_{jk}) \le 0, j \in J_k, k\in K</math> 
 
<math> A\lambda \le a </math>
 
<math>0 \le x, \nu_{jk} \le U, 0 \le \lambda_{jk} \le 1, j \in J_k, k\in K, (CRP)</math> 
 
where <math>\nu_{jk}</math> is the disaggregated continuous variable for the <math>j </math>th term in the <math>k </math>th disjunction and <math>\lambda_{jk}</math> is the corresponding multiplier for each term <math>j \in J_k</math> in a given disjunction <math>k\in K</math> 
Given the problem (R) yields a lower bound, the problem (CRP) is a relaxation of problem (R).

===Global Upper Bound Subproblem===
This is step is to obtain a valid upper bound for problem (P) based on MINLP reformulation of (P) using big-M formulation.
 
<math>min Z = \sum_{k\in K}\sum_{j\in J_k} \gamma_{jk}y_{jk} + f(x)</math> 
 
s.t. <math>r(x)\le 0</math> 
 
<math>g_{jk}(x) \le M_jk(1-y_jk), j \in J_k, k\in K</math> 
 
<math> x = \sum_{j \in J_k} y_{jk}= 1, k\in K</math> 
 
<math> Ay \le a </math>
 
<math>0 \le x \le U, y_{jk} \in \{0,1\}, j \in J_k, k\in K, (P-MIP)</math> 
 
where <math>M_jk</math> is the big-M parameter, which provides a valid upper bound to the violation of <math>g_jk \le 0</math> and <math>U</math> is an upper bound to <math>x</math>.

===Bound Contraction Procedure===
The bound contraction scheme is designed to tighten the lower and upper bound of a given continuous variable <math>x_i</math>, which will eliminate the nonoptimal subregions and accelerate the search. This will greatly reduce the difference between lower and upper bounds of the objective function and save computational work. 
The contraction scheme is shown by the following NLP problem: 
 
<math>min/max </math> <math>x_i </math> 
 
s.t. <math>\sum_{k\in K}\sum_{j\in J_k} \gamma_{jk}\lambda_{jk} + \bar{f}(x)r(x)\le GUB</math> 
 
<math>\bar{r}(x) \le 0 </math> 
 
<math> x = \sum_{j \in J_k}\nu_{jk}, \sum_{j \in J_k}\lambda_{jk}= 1, k\in K</math> 
 
<math> 0 \le \nu_{jk} \le \lambda_{jk}U_{jk},j \in J_K, k \in K</math> 
 
<math> \lambda_{jk}\bar{g}_{jk}(\nu_{jk}/\lambda_{jk}) \le 0, j \in J_k, k\in K</math> 
 
<math> Ay \le a </math>
 
<math>0 \le \nu^{jk} \le U, 0 \le \lambda_{jk} \le 1, j \in J_k, k\in K, (BCP)</math> 
 
As is shown in the following Fig.3, if the solution point <math>x_i</math> of problem (CRP) does not lie at its bound, then we solve the NLP problem (CRP) and update the upper and lower bounds based on the relative distance from the solution to each bound of to decide the direction of bound contraction. Bound contraction is applied to only the continuous variables. Not much reduction will be made if the contraction is not successful, and then move to next variable. 
[[File:Fig03_KL.png|300px]] 
===Branch and Bound on Boolean Variables===
This step is where branch and bound method is applied in the space of the terms of the disjunctions by solving the relaxed convex NLP problem (CRP) at each node. The rule is to select the variable <math>\lambda_{jk}</math> with the largest fractional value in the solution. By creating two child nodes with <math>\lambda_{jk}=1</math> and <math>\lambda_{jk}=0</math>, <math>Y_{jk}</math> is fixed in problem (R) respectively. The nodes number should be finite due to finite Boolean variables and the global lower bound increases monotonically. 
Upper bound needs to be updated when there is a gap between the solution of this problem and the original nonconvex GDP problem (P). When the feasible solution to problem (R) is obtained and the gap between nonconvex term and convex underestimator in problem (P) and problem (R) is nonzero, all Boolean variables should be fixed and Spatial Branch and Bound Method should be introduced. 

===Spatial Branch and Bound Method===
Since there is a specific wiki page introducing this method, there is no need to discuss the detailed algorithm here. In this case, by solving problem (CRP), the nonconvex term with the max gap is chosen. Then, choose the middle point of the variable bound as the branching point. Lastly, select the node with lowest objective value. The current node will be fathomed if the lowest objective value of the node is greater than GUB.
==Advantages and Disadvantages==
===Advantages===
Many application shows that tighter lower bounds can be obtained in bilinear and concave problems by applying basic Nonconvex GDP steps, which often leads to a significant reduction of the numbers of nodes in Spatial Branch and Bound method. This is a direct indication of achieving tightening bounds. 
Meanwhile, GDPs are often reformulated as an MINLP, which enables us to take advantages of the existing MINLP solvers.

===Disadvantages===
The main problem about this method is that the question how to efficiently implement the strong relaxations within a spatial branch and bound framework in a large scale system still cannot be answered.

=Applications and Examples=
==Application: Solution Algorithm for nonconvex GDP problems==
===Step 0 Heuristic Search (Nonconvex MINLP)===
Use MINLP solvers (such as DICOPT++) to solve the nonconvex problem (P-MIP), and set GUB as the best upper bound and let <math>(Y^*,x*)</math> be the solution. 
===Step 1 Bound Contraction (Convex NLP)===
# Initialize the Relative Distance from <math>x^*_i</math> to each bound 
<math>RDL_i = \frac{x^*_i - x^L_i}{x^U_i - x^L_i}, RDU_i = \frac{x^U_i - x^*_i}{x^U_i - x^L_i}</math> 
# Increase iteration, solve problem (BCP) 
Update bound and continue with <math>x^*_i</math> when contraction is greater than <math>SP_M</math>. Otherwise select next x. 
# Return to 2 and repeat until max iteration. 
===Step 2 Branch and Bound on Discrete Variables (Convex NLP)===
# set tolerance <math>\alpha</math> for difference in <math>Z^L</math> and GUB. Set <math>\varepsilon</math> for max gap.
# Solve problem (CRP) to obtain lower bound <math>Z^L</math>. Update lowest lower bound as GLB.
# Fathom, go to step 3 or Branch on the node.
===Step 3 Spatial Branch and Bound (Convex NLP)===
# Fix all <math>\lambda_{jk}</math> according to solution from step 2
# Solve problem (CRP) until no open node with <math>Z^L \le GUB - \alpha</math>
# Go to step 2

==Example==
===Optimal structure for process system===
The system net work is shown in the following Fig4. 

[[File:Fig04_KL.png|thumb|left|450px|Fig4. Superstructure of the process]]
 

The formulation for this problem is shown as follows: 
<math> min Z =-35P1A - 30P2B +10F1 + 8F2 + F4A + F4B + 4F5A + 4F5B + 2CF + 50CD</math> 
 
<math>s.t. F3A = 0.55F1 + 0.5F2</math> 
 
<math>F3B = 0.45F1 + 0.5F2</math> 
 
<math>P1A = F8A + F10A + F6A </math> 
 
<math>P1B = F8B + F10B + F6B </math> 
 
<math>P2A = F9A + F11A + F7A</math> 
 
<math>P2B = F9B +F11B+F7B</math> 
 
<math>F6A=E6F3A,F6B=F6F3B</math> 
 
<math>F7A=E7F3A,F7B=E7F3B</math> 
 
<math>E4+D5+D6+D7 = 1</math> 
 
<math>P1A \ge 4.0P1B, P2B \ge 3.0P2A</math> 
 
<math>P1A+P1B \le 15, P2A +P2B \le 18</math> 
 

<math>\begin{bmatrix}
YF\\
F4A=E4F3A,F4B=E4F3B\\
2.5 \le F4A+F4B \le 25\\
F8A=0.85F4A, F8B=0.20F4B\\
F9A=0.15F4A,F9B=0.8F4B\\
CF=2
\end{bmatrix}</math>
<math>\or</math>
<math>\begin{bmatrix}
\lnot YF\\
F4A=F4B=0\\
F8A=F8B=0\\
F9A=F9B=0\\
E4=0\\
CF=0
\end{bmatrix}</math> 
 

<math>\begin{bmatrix}
YD\\
F5A=E5F3A,F5B=E5F3B\\
2.5 \le F5A+F5B \le 25\\
F10A=0.975F5A, F10B=0.050F5B\\
F11A=0.025F5A,F11B=0.950F5B\\
CD=25
\end{bmatrix}</math>
<math>\or</math>
<math>\begin{bmatrix}
\lnot YD\\
F5A=F5B=0\\
F10A=F10B=0\\
F11A=F11B=0\\
E5=0\\
CD=0
\end{bmatrix}</math>
 
<math>0 \le CF,CD; F1,F2 \le 25; 0 \le E4,E5,E6,E7 \le 1;</math> 
 
<math>Y\in \{true,false\}</math> 
 
The optimal solution is -510.08, and <math> F1^* = 8, F2^* = 25, P1^* = 15, P2^* = 18, E^* = (0.108,0.758,0,0.134)</math>, and <math>Y^*=(true,true)</math>, shown as the following Fig5. 
[[File:Fig05_KL.png|thumb|left|450px|Fig5. Optimal Solution]]
 

Here is the detailed explanation of the method applied in this example. Bilinear terms are replaced by continuous variables and linear underestimators and overestimators are introduced to construct the relaxed convex GDP problem (R). In step 0, which is mentioned above, DICOPT++ finds the trivial solution of <math>0.0</math> ath first NLP. By solving 14 LP subproblems in step 1, the bound is reduced <math>54.17%</math> for continuous variables. In step 2, discrete branch and bound on problem (CRP) reaches a feasible solution to the relaxed GDP at the third node, where <math>Y^L=(1,1)</math> and objective function of <math>-661.56</math>. When switches to Spatial Branch and Bound in step 3, <math>Y^F=(1,1)</math> is fixed. Then after 27 nodes and gap being reduced, upper bound is found to be <math>-510.08</math>. Using the upper bound to update GUB value in step 2, a cut to exclude <math>Y = (true,true)</math> is added to the previous node and the problem (CRP) is resolved, which leads to a infeasible solution. Then by closing current node and backtrack, two more nodes are searched and fathomed by the GUB.

=Conclusion=
For nonconvex generalized disjunctive programming, specified algorithm can provide a global optimum more efficiently by tightening the lower bound and reduce nodes in Spatial Branch and Bound step, compared with conventional MINLP and GBD algorithms. This method address the increasing need in applications in engineering and other area, where nonlinearity and nonconvexity are essential due to the problem nature.

=References=
# Lee, Sangbum, and Ignacio E. Grossmann. "New algorithms for nonlinear generalized disjunctive programming." Computers & Chemical Engineering 24.9 (2000): 2125-2141.
# Lee, Sangbum, and Ignacio E. Grossmann. "A global optimization algorithm for nonconvex generalized disjunctive programming and applications to process systems." Computers & Chemical Engineering 25.11 (2001): 1675-1697.
# Lee, Sangbum, and Ignacio E. Grossmann. "Global optimization of nonlinear generalized disjunctive programming with bilinear equality constraints: applications to process networks." Computers & chemical engineering 27.11 (2003): 1557-1575.
# Grossmann, Ignacio E., and Juan P. Ruiz. "Generalized Disjunctive Programming: A framework for formulation and alternative algorithms for MINLP optimization." Mixed Integer Nonlinear Programming. Springer New York, 2012. 93-115.

Extended cutting plane (ECP)

2022-04-01T15:36:22Z

Btantisujjatham:

This web page is a duplicate of https://optimization.mccormick.northwestern.edu/index.php/Extended_cutting_plane_(ECP)

Authors: Kyung Je Lee (ChE 345 Spring 2015)
Stewards: Dajun Yue, Fengqi You

==Introduction==
===Background===
'''Extended Cutting Plane''' is an optimization method suggested by Westerlund and Petersson in 1996 to solve Mixed-Integer NonLinear Programming (MINLP) problems[1]. ECP can be thought as an extension of Kelley's cutting plane method, which uses iterative Newton's method to refine feasible area and ultimately solve a problem within tolerable error. Therefore, ECP method does not require solving Non-Linear Programming(NLP) while Branch and Bound(BB) and Outer Approximation(OA) do require NLP solution for an upper bound. However, ECP generally requires more iterations. Although cutting plane method is criticized for its slow convergence, ECP is more efficient in cases where evaluation of non linear functions are time costly. For example, in optimization of a dynamic separation process done by Stefan Emet and Tapio Westerlund, ECP was 100 times faster than BB and 10 times faster than OA in solving the MINLP[2].

===Comparison with outer approximation===
ECP is closely related to OA in that it uses first differential approximation to cut the plane. The difference is that OA solves a NLP problem for its upper bound while ECP only uses solution of MILP in each iteration. This means that OA is more efficient in main iteration loops. This advantage of OA is more prominent in strongly non-linear MINLP problems that are mainly consisted of continuous variables. However, in pure integer non-linear programming problems, OA method does not have any advantage over ECP. Although it takes less iterations for OA to converge, solving NLP problem in each iteration incur more computation work than solving MILP because MILP has only one additional linear constraints in each iteration.

===Formulation of MINLP problem===

A general MINLP can be formulated as the following:

<math> \min_{x,y} Z=C_x^T x + C_y^T y </math>

<math>s.t. \quad g(x,y) \leq 0</math>

where, cx and cy are vectors with constants, x is a vector of continuous variables, y is a vector with discrete variables
and g(x, y) is a vector with continuous convex functions, all defined on a set.

==ECP algorithm==
Since g(x,y) is a convex function set and continuously differentiable it follows that

<math> g_i(x^k,y^k) + \left( \frac{\partial g_i}{\partial x}\right)_{x^k,y^k} (x-x^k) + \left( \frac{\partial g_i}{\partial y}\right)_{x^k,y^k} (y-y^k) \leq g_i(x,y)</math>

Also, if,

<math> max(g_i(x^k,y^k)) \leq 0 </math>

then for all i

<math> g_i(x^k,y^k) \leq 0</math>

Using these properties, ECP algorithm is as follows:

'''Step 0.''' Set k=1. Leave out all nonlinear constraints and set up an MILP problem.

'''Step 1.''' Solve the MILP problem.

'''Step 2.''' Check if the solution satisfies nonlinear constraints. If all constraints are satisfied, in other words, if <math> g_i(x^k,y^k) \leq 0</math> is true for all i, <math>(x^k, y^k)</math> is a global optimum.

'''Step 3''' If any of the constraints are not satisfied, add a constraint for most violated constraint, g
::<math> g_i(x^k,y^k) + \left( \frac{\partial g_i}{\partial x}\right)_{x^k,y^k} (x-x^k) + \left( \frac{\partial g_i}{\partial y}\right)_{x^k,y^k} (y-y^k)\leq 0 </math>
::set k= k+1 and go to Step 1.

== Example ==
<math> \min z = -x_1-x_2 </math>

<math> s.t. \quad g1(x_1,x_2) = 0.15(x_1-8)^2 + 0.1 (x_2-6)^2+0.025e^{x_1}x_2^{-2}-5 \leq 0 </math>
::<math> g2(x_1,x_2) = 1/x_1 + 1/x_2- x_1^{0.5}x_2^{0.5}+4 \leq 0 </math>
::<math> 2x_1-3x_2-2 \leq 0 </math>
::<math> 1\leq x_1 \leq 20,\quad 1\leq x_2\leq 20, \quad x_1\in \R \quad x_2 \in \N</math>

First MILP can be set-up by leaving out non-linear constraint <math> g_1</math> and <math> g_2</math>
[[File:ECP.PNG|350px|thumb|right|Visualization of how ECP is used to solve MINLP4]]
<math> \min z = -x_1-x_2 </math>

<math> s.t. \quad2x_1-3x_2-2 \leq 0 </math>
::<math> 1\leq x_1 \leq 20,\quad 1\leq x_2\leq 20 </math>
::<math> x_1\in \R \quad x_2 \in \N </math>

The first iteration gives <math>x_1^1 = 20, x_2^1=20 </math>.
This leads to <math>g_1(x_1^1,x_2^1)=30359, \quad g_2(x_1^1,x_2^1)= -15.9 </math>

A new constraint is introduced for the second iteration with the most violated nonlinear constraint <math>g_1</math>

<math> \min z = -x_1-x_2 </math>

<math> s.t. \quad g_1(x_1^1,x_2^1)+\triangledown g_1(x_1^1,x_2^1)^T(x-x_1^1,x-x_2^1) \leq 0 </math>
::<math> 2x_1-3x_2-2 \leq 0 </math>
::<math> 1\leq x_1 \leq 20,\quad 1\leq x_2\leq 20 </math>
::<math> x_1\in \R \quad x_2 \in \N </math>

Same procedure can be taken for consequent iterations, and with 17 iterations, ECP gives answer of <math>x_1^1 = 8.9, x_2^1=12 </math>.

==Non-smooth Functions==
In non-smooth functions, ECP algorithm can be generalized by relaxing continuous differentiability. Therefore, the only change in the algorithm is that subgradients are used instead of gradients. Subgradient of convex function <math>h</math> at point<math>z_0</math> is any vector <math>\xi</math> that satifies
<math>h(z_0) + \xi(z-z_0) \leq h(z)</math> 3 

==Conclusion==
ECP is an extension of cutting plane(CP) method that is used to solve NLP problems. The application of cutting plane to MINLP is rather straight forward and the strength of ECP lies in that it is simple and robust. Therefore, it is suitable for solving large MINLP problems with moderate degree of non-linearity and complex system that require extensive computational work. ECP only requires one additional constraint to improve one solution for MILP problem at each iteration whereas both NLP and MILP problems are solved in other MINLP methods. However, since there is only one adjustment made at a time, it has slow convergence to solution.

==Reference==
1. Tapio Westerlund, Frank Pettersson, An extended cutting plane method for solving convex MINLP problems, Computers & Chemical Engineering, Volume 19, Supplement 1, 11–14 June 1995, Pages 131-136, ISSN 0098-1354, http://dx.doi.org/10.1016/0098-1354(95)87027-X.

2. Stefan Emet and Tapio Westerlund. 2008. Solving a dynamic separation problem using MINLP techniques. Appl. Numer. Math. 58, 4 (April 2008), 395-406. DOI=10.1016/j.apnum.2007.01.023 http://dx.doi.org/10.1016/j.apnum.2007.01.023

3.Tapio Westerlund, Hans Skrifvars, Iiro Harjunkoski, Ray Pörn, An extended cutting plane method for a class of non-convex MINLP problems, Computers & Chemical Engineering, Volume 22, Issue 3, 28 February 1998, Pages 357-365, ISSN 0098-1354, http://dx.doi.org/10.1016/S0098-1354(97)00000-8.
(http://www.sciencedirect.com/science/article/pii/S0098135497000008)

4. An extended supporting hyperplane algorithm for convex MINLP problems http://blogs.abo.fi/ose/files/2014/10/ose2014_kronqvist.pdf

Classical robust optimization

2022-04-01T15:35:47Z

Btantisujjatham:

This web page is a duplicate of https://optimization.mccormick.northwestern.edu/index.php/Classical_robust_optimization

Author Names: Andre Ramirez-Cedeno 
Steward: Fengqi You, Dajun Yue 

==Introduction and History==
Robust optimization is a subset of optimization theory that deals with a certain measure of robustness vs uncertainty. This balance of robustness and uncertainty is represented as variability in the parameters of the problem at hand and or its solution [1]. In robust optimization, the modeler aims to find decisions that are optimal for the worst-case realization of the uncertainties within a given set [2]. Robust optimization dates back to the beginning of modern decision theory in the 1950’s. It became a discipline of its own in the 1970’s with paralleled development in other technological fields.

===Uncertainty===
Sources of uncertainty could be due to at least three different conditions [3]. 
'''1. Ignorance''' - such as not knowing exactly how much oil is in a reserve 
'''2. Noise''' - such as measurement errors, or incomplete data 
'''3. Events that have not yet occurred''' - such as future product demand 

===What is Robustness?===

Robustness refers to the ability of a system to cope with errors during an execution. It can also be defined as the ability of an algorithm to continue operating despite abnormalities in calculations. Most algorithms try to find a balance between robustness and efficiency/execution time [4].

==Applications==

Robust Optimization has traditionally been applied in statistics, but is now applied in; 

1. Operations research 
2. Control theory 
3. Finance and Portfolio management 
4. Logistics 
5. Manufacturing engineering 
6. Chemical engineering (Oil field developing) 
7. Medicine 
8. Computer science 

In engineering problems, these formulations often take the name of "Robust Design Optimization" or "Reliability Based Design Optimization"

==Simple Mathematical Example==

<math>
\begin{align}
\text{Max} & ~~ 5X + 2Y\\
\text{s.t} & ~~ cX + dY \le 15 \\
& ~~ X \le 0 \\
& ~~ Y \le 0\\
& ~~ \forall c,d \\
\end{align}
</math>

This problem is a simple example of a robust optimization problem. The last clause; “for all (c,d) ϵ P” makes it a robust optimization problem because it implies that for a pair (X,Y) to be acceptable, the constraint cX + dY <= 15 must be satisfied for all values of (c,d) including the worst (c,d) pair that maximized the value of cX + dY for the given values of (x,y). For this example, P is simplified to a finite set meaning that for each (c,d) within the set, there is a constraint cX + dY <= 15.

==Engineering Design Example==
===Rolls-Royce [5]===

Robust design allows variation in the design process and the consideration of the appropriate selection of the nominal design point.

'''Step 1:''' Define – Understand what is important to the client and formulate problem in engineering language. Choose design concepts with variation in mind. 

'''Step 2:''' Characterize – Generate measurable “critical to quality” (CTQ’s) criteria. For each CTQ, understand the possible sources of variation and measure the effects of variation. 

'''Step 3:''' Optimize – For each CTQ, choose a strategy to reduce the effects of variation. 

'''Step 4:''' Verify – Use knowledge of variation from previous steps to determine its effects in construction and design plan.

The Variance Transmission equation is given by: 

'''<math> \sigma_Y^2 </math> = <math> \sigma_{X1}^2 \left ( \frac{dY}{dX1} \right )^2 + \sigma_{X2}^2 \left ( \frac{dY}{dX2} \right )^2 + \cdots + \sigma_{Xn}^2 \left ( \frac{dY}{dXn} \right )^2</math>'''

[[File:Rolls_Royce_robust_design.png]]

==References==

[1] Wikipedia page for Robust Optimization 
[2] http://www.robustopt.com/references/Robust%20Optimization%20Made%20Easy%20with%20ROME.pdf 
[3] J. Rosenhead. Robustness analysis: keeping your options open. In J. Rosenhead, editor, Rational analysis for a problematic world: problem structuring methods for complexity, uncertainty and conflict, pages 193–218, Chichester, UK, 1989. John Wiley & Sons. 
[4] http://www.stanford.edu/~bakerjw/Publications/Baker%20et%20al%20(2008)%20Robustness,%20Structural%20Safety.pdf 
[5] http://web.stanford.edu/group/uq/events/optimization/2011/3-rollsroyce.pdf

Dynamic optimization

2022-04-01T15:35:15Z

Btantisujjatham:

This web page is a duplicate of https://optimization.mccormick.northwestern.edu/index.php/Dynamic_optimization

Authors: Hanyu Shi (ChE 345 Spring 2014)

Steward: Dajun Yue, Fengqi You

Date Presented: Apr. 10, 2014

Authors: Issac Newton, Albert Einstein (ChE 345 Spring 2014)

Steward: Dajun Yue, Fengqi You

Date Presented: Apr. 10, 2014

==Introduction==
In this work, we will focus on the “at the same time” or direct transcription approach which allow a simultaneous method for the dynamic optimization problem. In particular, we formulate the dynamic optimization model with orthogonal collocation methods. These methods can also be regarded as a special class of implicit Runge–Kutta (IRK) methods. We apply the concepts and properties of IRK methods to the differential equations directly. With locating potential break points appropriately, this approach can model large-scale optimization formulations with the property of maintaining accurate state and control profiles. We mainly follows Biegler's work.

==General Dynamic Optimization Problem==

Differential algebraic equations in process engineering often have following characteristics: first,large-scale models – not easily scaled; second, sparse but no regular structure; third, direct linear solvers widely used; last, coarse-grained decomposition of linear algebra.

[[File:shy 345 wiki fig 02.png]]

Figure 2. Dynamic optimization approach

There are several approaches can be applied to solve the dynamic optimization problems, which are shown in Figure 2.

Differential equations can usually be used to express conservation Laws, such as mass, energy, momentum. Algebraic equations can usually be used to express constitutive equations, equilibrium, such as physical properties, hydraulics, rate laws. Algebraic equations usually have semi-explicit form and assume to be index one i.e., algebraic variables can be solved uniquely by algebraic equations.

Dynamic Optimization Problem has the following general form:

<math>

\begin{array}{l}
\min \;\Phi \left( {z\left( t \right),y\left( t \right),u\left( t \right),p,{t_f}} \right)\\
s.t.\;\;\frac{{dz\left( t \right)}}{{dt}} = f\left( {z\left( t \right),y\left( t \right),u\left( t \right),p} \right)\\
g\left( {z\left( t \right),y\left( t \right),u\left( t \right),p} \right) = 0\\
{z^0} = z\left( 0 \right)\\
{z^l} \le z\left( t \right) \le {z^u}\\
{y^l} \le y\left( t \right) \le {y^u}\\
{u^l} \le u\left( t \right) \le {u^u}\\
{p^l} \le p \le {p^u}
\end{array}

</math>

<math>t</math>, time

<math>z</math>, differential variables y, algebraic variables

<math>t_f</math> , final time

<math>u</math>, control variables

<math>p</math>, time independent parameters

(This follows Biegler's slides ）

==Derivation of Collocation Methods==
We first consider the differential algebraic system shown as follows:

<math>

\begin{array}{l}
\frac{{dz}}{{dt}} = f\left( {z\left( t \right),y\left( t \right),u\left( t \right),p} \right),\;z\left( 0 \right) = {z_0}\\
g\left( {z\left( t \right),y\left( t \right),u\left( t \right),p} \right) = 0
\end{array}

</math> （1）

The simultaneous approach requires discretizing of the state variables <math> z\left( t \right) </math>, output variables <math> y\left( t \right) </math> and manipulate variables <math> u\left( t \right) </math>. We require the following properties to yield an efficient NLP formulation:

1) The explicit ODE discretization holds little computational advantage because Since the nonlinear program requires an iterative solution of the KKT conditions.

2) A single step approach which is self-starting and does not rely on smooth profiles that extend over previous time steps is preferred, because the NLP formulation needs to deal with discontinuities in control profiles.

3) The high-order implicit discretization provides accurate profiles with relatively few finite elements. As a result, the number of finite elements need not be excessively large, particularly for problems with many states and controls.

[[File:shy 345 wiki fig 01.png]]

Figure 1: Polynomial approximation for state profile across a finite element.

==Polynomial Representation for ODE Solutions==

We consider the following ODE:

<math>

\frac{{dz}}{{dt}} = f\left( {z\left( t \right),t} \right),\;z\left( 0 \right) = {z_0}

</math> (2)

to apply the collocation method, we need to solve the differential equation (2) at certain points. For the state variable, we consider a polynomial approximation of order <math>K+1</math> (i.e., degree ≤ <math>K</math> ) over a single finite element, as shown in the above figure. This polynomial, denoted by <math>{z^K}(t)</math>, can be represented in a number of equivalent ways, including the power series representation shown in equation (3), the Newton divided difference approximation, or B-splines.

<math>

\frac{{dz}}{{dt}} = f\left( {z\left( t \right),t} \right),\;z\left( 0 \right) = {z_0}

</math> (3)

We apply representations based on Lagrange interpolation polynomials to generate the NLP formulation, because the polynomial coefficients and the profiles have the same variable bounds. Here we select <math>K+1 </math> interpolation points in element i and represent the state in a given element <math>i</math> as

<math>

\begin{array}{l}
\left. {\begin{array}{*{20}{c}}
{t = {t_{i - 1}} + {h_i} \cdot \tau ,}\\
{{z^K}\left( t \right) = \sum\limits_{j = 0}^K {{l_j}\left( \tau \right) \cdot {z_{ij}},} }
\end{array}} \right\}t \in \left[ {{t_{i - 1}},{t_i}} \right],\tau \in \left[ {0,1} \right],\\
where\;{l_j}\left( \tau \right) = \prod\limits_{k = 0, \ne j}^K {\frac{{\left( {\tau - {\tau _k}} \right)}}{{\left( {{\tau _j} - {\tau _k}} \right)}}} ,
\end{array}

</math> （4）

<math> {\tau _0} = 0,\;{\tau _i} < {\tau _{i + 1}},\;j = 0,...,K - 1 </math>， and hi is the length of element <math>i</math>. This polynomial representation has the desirable property that <math>{z^K}({t_{ij}}) = {z_{ij}}</math>, where <math>{t_{ij}} = {t_{i - 1}} + {\tau _j}{h_j}</math>.

We use a Lagrange polynomial with K interpolation points to represent the time derivative of the state. This leads to the Runge–Kutta basis representation for the differential state:

<math>

{z^K}\left( t \right) = {z_{i - 1}} + {h_i} \cdot \sum\limits_{j = 1}^K {{\Omega _j}\left( \tau \right)} \cdot {\dot z_{ij}}

</math> (5)

where <math> {z_{i - 1}} </math> is a coefficient that represents the differential state at the beginning of element <math>i</math>, <math> {\dot z_{ij}} </math>represents the time derivative <math>\frac{{d{z^K}({t_{ij}})}}{{d\tau }} </math>, and <math> {\Omega _j}(\tau ) </math> is a polynomial of order K satisfying

<math>

{\Omega _j}\left( \tau \right) = \int_0^\tau {{l_j}(\tau ')} d\tau ',t \in \left[ {{t_{i - 1}},{t_i}} \right],\tau \in \left[ {0,1} \right]

</math> (6)

We substitute the polynomial into equation (1) to calculate the polynomial coefficients, which is an approximation of the DAE. This results in the following collocation equations.

<math>

{z^K}\left( t \right) = {z_{i - 1}} + {h_i} \cdot \sum\limits_{j = 1}^K {{\Omega _j}\left( \tau \right)} \cdot {\dot z_{ij}}

</math> (7)

with <math> {z^k}({t_i} - 1) </math> calculated separately. For the polynomial representations (4) and (5), we normalize time over the element, write the state profile as a function of τ , and apply <math> \frac{{d{z^K}}}{{d\tau }} = {h_i}\frac{{d{z^K}}}{{dt}} </math> easily. For the Lagrange polynomial (4), the collocation equations become

<math>

\sum\limits_{j = 0}^K {{z_{ij}} \cdot \frac{{d{l_j}\left( {{\tau _k}} \right)}}{{d\tau }}} = {h_i} \cdot f\left( {{z_{ik}},{t_{ik}}} \right),\;k = 1,...,K

</math> (8)

while the collocation equations for the Runge–Kutta basis are given by

<math>

{\dot z_{ik}} = f\left( {{z_{ik}},{t_{ik}}} \right)

</math> (9)

<math>

{z_{ik}} = {z_{i - 1}} + {h_i} \cdot \sum\limits_{j = 1}^K {{\Omega _j}\left( \tau \right)} \cdot {\dot z_{ij}},\;k = 1,...,K

</math> (10)

with <math> {z_i} - 1 </math>determined from the previous element <math> i-1 </math> or from the initial condition on the ODE.

==Example==

An example is given here to demonstrate the application of the collocation method.

A differential equation is given as follows:

<math>

\frac{{dz}}{{dt}} = {z^2} - 2 \cdot z + 1,\;z\left( 0 \right) = - 3

</math> (11)

With t \in \left[ {0,1} \right], The analytic solution of this differential equation is <math>z\left( t \right) = \frac{{4 \cdot t - 3}}{{4 \cdot t + 1}}</math>.

Lagrange interpolation and collocation method is applied to this differential equation respectively. And the number of collocation points in each finite element is 3. The number of finite elements is N , and the length of the finite element is 1/N . The following equations is given then:

<math>

\sum\limits_{j = 0}^3 {{z_{ij}}\frac{{d{l_j}\left( {{\tau _k}} \right)}}{{d\tau }}} = h\left( {z_{ik}^2 - 2 \cdot {z_{ik}} + 1} \right),\;k = 1,...,3,\;i = 1,...,N

</math> (12)

<math>

{z_{i + 1,0}} = \sum\limits_{j = 0}^0 {{l_j}\left( 1 \right)} \cdot {z_{ij}},\;i = 1,...,N - 1

</math> (13)

<math>

{z_f} = \sum\limits_{j = 0}^K {{l_j}\left( 1 \right)} \cdot {z_{Nj}},\;{z_{1,0}} = -3

</math> (14)

With Radau collocation method, <math> {\tau _0} = 0 </math>, <math> {\tau _1} = {\rm{0.155051}}</math>, <math> {\tau _2} = {\rm{0.644949}}</math> and <math> {\tau _3} = 1 </math> can be obtained. The collocation equations are given as follows:

<math>

\sum\limits_{j = 0}^3 {{z_j}\frac{{d{l_j}\left( {{\tau _k}} \right)}}{{d\tau }}} = \left( {z_k^2 - 2 \cdot {z_k} + 1} \right),\;k = 1,...,3

</math> (14)

which can be formulated as:

<math>

\begin{array}{l}
{z_0} \cdot \left( { - 30 \cdot \tau _k^2 + 36 \cdot {\tau _k} - 9} \right) + {z_1} \cdot \left( {{\rm{46.7423}} \cdot \tau _k^2 - {\rm{51.2592}} \cdot {\tau _k} + {\rm{10.0488}}} \right)\\
+ {z_3} \cdot \left( { - {\rm{26.7423}} \cdot \tau _k^2 + {\rm{20.5925}} \cdot {\tau _k} - {\rm{ 1.38214}}} \right) + {z_3} \cdot \left( {10 \cdot \tau _k^2 - \frac{{16}}{3} \cdot {\tau _k} + \frac{1}{3}} \right)\\
= \left( {z_k^2 - 2 \cdot {z_k} + 1} \right),\;k = 1,...,3
\end{array}

</math> (15)

[[File:shy 345 wiki fig 03.png]]

Figure 3. Comparison of Radau collocation solution with exact solution

The results are given as following by solving the above equations:

<math>

\left\{ {\begin{array}{*{20}{c}}
{{z_0} = - 3}\\
{{z_1} = - {\rm{1.65701}}}\\
{{z_2} = {\rm{0.032053}}}\\
{{z_3} = {\rm{0.207272}}}
\end{array}} \right.

</math> (16)

As shown in Figure 3.2 the error <math>\left\| {z\left( 1 \right) - {z^K}\left( 1 \right)} \right\|</math>， is less than <math>10^-6</math> for <math>N=5</math> and converges with <math>O(h^5)</math>, which is consistent with the expected order <math>2K-1</math>.

(This example follows the work of Biegler and can be found in P293 of “ Nonlinear Programmng”.)

==Conclusion==

In this work, we mainly discussed simultaneous collocation approach for dynamic optimization problems, which formulated the differential equations to a set of algebraic equations. These direct transcription formulations depended on fully discretizing of the differential algebraic equations (DAE), which enabled us solve the simultaneous optimization problem without relying on embedded DAE solvers. Because of this simultaneous formulation, we got the exact first and second order derivatives through the optimization modeling system, and both structure and sparsity can be exploited.

==References==

1. Biegler, Lorenz T. Nonlinear programming: concepts, algorithms, and applications to chemical processes. Vol. 10. SIAM, 2010.

2. Chu, Yunfei, and Fengqi You. "Integration of scheduling and control with online closed-loop implementation: Fast computational strategy and large-scale global optimization algorithm." Computers & Chemical Engineering 47 (2012): 248-268.

3. http://en.wikipedia.org/wiki/Dynamic_programming

4. http://en.wikipedia.org/wiki/Differential_algebraic_equation

5. http://numero.cheme.cmu.edu/uploads/dynopt.pdf

Dynamic optimization

2022-04-01T15:35:04Z

Btantisujjatham:

This web page is a duplicate of https://optimization.mccormick.northwestern.edu/index.php/Dynamic_optimization
Authors: Hanyu Shi (ChE 345 Spring 2014)

Steward: Dajun Yue, Fengqi You

Date Presented: Apr. 10, 2014

Authors: Issac Newton, Albert Einstein (ChE 345 Spring 2014)

Steward: Dajun Yue, Fengqi You

Date Presented: Apr. 10, 2014

==Introduction==
In this work, we will focus on the “at the same time” or direct transcription approach which allow a simultaneous method for the dynamic optimization problem. In particular, we formulate the dynamic optimization model with orthogonal collocation methods. These methods can also be regarded as a special class of implicit Runge–Kutta (IRK) methods. We apply the concepts and properties of IRK methods to the differential equations directly. With locating potential break points appropriately, this approach can model large-scale optimization formulations with the property of maintaining accurate state and control profiles. We mainly follows Biegler's work.

==General Dynamic Optimization Problem==

Differential algebraic equations in process engineering often have following characteristics: first,large-scale models – not easily scaled; second, sparse but no regular structure; third, direct linear solvers widely used; last, coarse-grained decomposition of linear algebra.

[[File:shy 345 wiki fig 02.png]]

Figure 2. Dynamic optimization approach

There are several approaches can be applied to solve the dynamic optimization problems, which are shown in Figure 2.

Differential equations can usually be used to express conservation Laws, such as mass, energy, momentum. Algebraic equations can usually be used to express constitutive equations, equilibrium, such as physical properties, hydraulics, rate laws. Algebraic equations usually have semi-explicit form and assume to be index one i.e., algebraic variables can be solved uniquely by algebraic equations.

Dynamic Optimization Problem has the following general form:

<math>

\begin{array}{l}
\min \;\Phi \left( {z\left( t \right),y\left( t \right),u\left( t \right),p,{t_f}} \right)\\
s.t.\;\;\frac{{dz\left( t \right)}}{{dt}} = f\left( {z\left( t \right),y\left( t \right),u\left( t \right),p} \right)\\
g\left( {z\left( t \right),y\left( t \right),u\left( t \right),p} \right) = 0\\
{z^0} = z\left( 0 \right)\\
{z^l} \le z\left( t \right) \le {z^u}\\
{y^l} \le y\left( t \right) \le {y^u}\\
{u^l} \le u\left( t \right) \le {u^u}\\
{p^l} \le p \le {p^u}
\end{array}

</math>

<math>t</math>, time

<math>z</math>, differential variables y, algebraic variables

<math>t_f</math> , final time

<math>u</math>, control variables

<math>p</math>, time independent parameters

(This follows Biegler's slides ）

==Derivation of Collocation Methods==
We first consider the differential algebraic system shown as follows:

<math>

\begin{array}{l}
\frac{{dz}}{{dt}} = f\left( {z\left( t \right),y\left( t \right),u\left( t \right),p} \right),\;z\left( 0 \right) = {z_0}\\
g\left( {z\left( t \right),y\left( t \right),u\left( t \right),p} \right) = 0
\end{array}

</math> （1）

The simultaneous approach requires discretizing of the state variables <math> z\left( t \right) </math>, output variables <math> y\left( t \right) </math> and manipulate variables <math> u\left( t \right) </math>. We require the following properties to yield an efficient NLP formulation:

1) The explicit ODE discretization holds little computational advantage because Since the nonlinear program requires an iterative solution of the KKT conditions.

2) A single step approach which is self-starting and does not rely on smooth profiles that extend over previous time steps is preferred, because the NLP formulation needs to deal with discontinuities in control profiles.

3) The high-order implicit discretization provides accurate profiles with relatively few finite elements. As a result, the number of finite elements need not be excessively large, particularly for problems with many states and controls.

[[File:shy 345 wiki fig 01.png]]

Figure 1: Polynomial approximation for state profile across a finite element.

==Polynomial Representation for ODE Solutions==

We consider the following ODE:

<math>

\frac{{dz}}{{dt}} = f\left( {z\left( t \right),t} \right),\;z\left( 0 \right) = {z_0}

</math> (2)

to apply the collocation method, we need to solve the differential equation (2) at certain points. For the state variable, we consider a polynomial approximation of order <math>K+1</math> (i.e., degree ≤ <math>K</math> ) over a single finite element, as shown in the above figure. This polynomial, denoted by <math>{z^K}(t)</math>, can be represented in a number of equivalent ways, including the power series representation shown in equation (3), the Newton divided difference approximation, or B-splines.

<math>

\frac{{dz}}{{dt}} = f\left( {z\left( t \right),t} \right),\;z\left( 0 \right) = {z_0}

</math> (3)

We apply representations based on Lagrange interpolation polynomials to generate the NLP formulation, because the polynomial coefficients and the profiles have the same variable bounds. Here we select <math>K+1 </math> interpolation points in element i and represent the state in a given element <math>i</math> as

<math>

\begin{array}{l}
\left. {\begin{array}{*{20}{c}}
{t = {t_{i - 1}} + {h_i} \cdot \tau ,}\\
{{z^K}\left( t \right) = \sum\limits_{j = 0}^K {{l_j}\left( \tau \right) \cdot {z_{ij}},} }
\end{array}} \right\}t \in \left[ {{t_{i - 1}},{t_i}} \right],\tau \in \left[ {0,1} \right],\\
where\;{l_j}\left( \tau \right) = \prod\limits_{k = 0, \ne j}^K {\frac{{\left( {\tau - {\tau _k}} \right)}}{{\left( {{\tau _j} - {\tau _k}} \right)}}} ,
\end{array}

</math> （4）

<math> {\tau _0} = 0,\;{\tau _i} < {\tau _{i + 1}},\;j = 0,...,K - 1 </math>， and hi is the length of element <math>i</math>. This polynomial representation has the desirable property that <math>{z^K}({t_{ij}}) = {z_{ij}}</math>, where <math>{t_{ij}} = {t_{i - 1}} + {\tau _j}{h_j}</math>.

We use a Lagrange polynomial with K interpolation points to represent the time derivative of the state. This leads to the Runge–Kutta basis representation for the differential state:

<math>

{z^K}\left( t \right) = {z_{i - 1}} + {h_i} \cdot \sum\limits_{j = 1}^K {{\Omega _j}\left( \tau \right)} \cdot {\dot z_{ij}}

</math> (5)

where <math> {z_{i - 1}} </math> is a coefficient that represents the differential state at the beginning of element <math>i</math>, <math> {\dot z_{ij}} </math>represents the time derivative <math>\frac{{d{z^K}({t_{ij}})}}{{d\tau }} </math>, and <math> {\Omega _j}(\tau ) </math> is a polynomial of order K satisfying

<math>

{\Omega _j}\left( \tau \right) = \int_0^\tau {{l_j}(\tau ')} d\tau ',t \in \left[ {{t_{i - 1}},{t_i}} \right],\tau \in \left[ {0,1} \right]

</math> (6)

We substitute the polynomial into equation (1) to calculate the polynomial coefficients, which is an approximation of the DAE. This results in the following collocation equations.

<math>

{z^K}\left( t \right) = {z_{i - 1}} + {h_i} \cdot \sum\limits_{j = 1}^K {{\Omega _j}\left( \tau \right)} \cdot {\dot z_{ij}}

</math> (7)

with <math> {z^k}({t_i} - 1) </math> calculated separately. For the polynomial representations (4) and (5), we normalize time over the element, write the state profile as a function of τ , and apply <math> \frac{{d{z^K}}}{{d\tau }} = {h_i}\frac{{d{z^K}}}{{dt}} </math> easily. For the Lagrange polynomial (4), the collocation equations become

<math>

\sum\limits_{j = 0}^K {{z_{ij}} \cdot \frac{{d{l_j}\left( {{\tau _k}} \right)}}{{d\tau }}} = {h_i} \cdot f\left( {{z_{ik}},{t_{ik}}} \right),\;k = 1,...,K

</math> (8)

while the collocation equations for the Runge–Kutta basis are given by

<math>

{\dot z_{ik}} = f\left( {{z_{ik}},{t_{ik}}} \right)

</math> (9)

<math>

{z_{ik}} = {z_{i - 1}} + {h_i} \cdot \sum\limits_{j = 1}^K {{\Omega _j}\left( \tau \right)} \cdot {\dot z_{ij}},\;k = 1,...,K

</math> (10)

with <math> {z_i} - 1 </math>determined from the previous element <math> i-1 </math> or from the initial condition on the ODE.

==Example==

An example is given here to demonstrate the application of the collocation method.

A differential equation is given as follows:

<math>

\frac{{dz}}{{dt}} = {z^2} - 2 \cdot z + 1,\;z\left( 0 \right) = - 3

</math> (11)

With t \in \left[ {0,1} \right], The analytic solution of this differential equation is <math>z\left( t \right) = \frac{{4 \cdot t - 3}}{{4 \cdot t + 1}}</math>.

Lagrange interpolation and collocation method is applied to this differential equation respectively. And the number of collocation points in each finite element is 3. The number of finite elements is N , and the length of the finite element is 1/N . The following equations is given then:

<math>

\sum\limits_{j = 0}^3 {{z_{ij}}\frac{{d{l_j}\left( {{\tau _k}} \right)}}{{d\tau }}} = h\left( {z_{ik}^2 - 2 \cdot {z_{ik}} + 1} \right),\;k = 1,...,3,\;i = 1,...,N

</math> (12)

<math>

{z_{i + 1,0}} = \sum\limits_{j = 0}^0 {{l_j}\left( 1 \right)} \cdot {z_{ij}},\;i = 1,...,N - 1

</math> (13)

<math>

{z_f} = \sum\limits_{j = 0}^K {{l_j}\left( 1 \right)} \cdot {z_{Nj}},\;{z_{1,0}} = -3

</math> (14)

With Radau collocation method, <math> {\tau _0} = 0 </math>, <math> {\tau _1} = {\rm{0.155051}}</math>, <math> {\tau _2} = {\rm{0.644949}}</math> and <math> {\tau _3} = 1 </math> can be obtained. The collocation equations are given as follows:

<math>

\sum\limits_{j = 0}^3 {{z_j}\frac{{d{l_j}\left( {{\tau _k}} \right)}}{{d\tau }}} = \left( {z_k^2 - 2 \cdot {z_k} + 1} \right),\;k = 1,...,3

</math> (14)

which can be formulated as:

<math>

\begin{array}{l}
{z_0} \cdot \left( { - 30 \cdot \tau _k^2 + 36 \cdot {\tau _k} - 9} \right) + {z_1} \cdot \left( {{\rm{46.7423}} \cdot \tau _k^2 - {\rm{51.2592}} \cdot {\tau _k} + {\rm{10.0488}}} \right)\\
+ {z_3} \cdot \left( { - {\rm{26.7423}} \cdot \tau _k^2 + {\rm{20.5925}} \cdot {\tau _k} - {\rm{ 1.38214}}} \right) + {z_3} \cdot \left( {10 \cdot \tau _k^2 - \frac{{16}}{3} \cdot {\tau _k} + \frac{1}{3}} \right)\\
= \left( {z_k^2 - 2 \cdot {z_k} + 1} \right),\;k = 1,...,3
\end{array}

</math> (15)

[[File:shy 345 wiki fig 03.png]]

Figure 3. Comparison of Radau collocation solution with exact solution

The results are given as following by solving the above equations:

<math>

\left\{ {\begin{array}{*{20}{c}}
{{z_0} = - 3}\\
{{z_1} = - {\rm{1.65701}}}\\
{{z_2} = {\rm{0.032053}}}\\
{{z_3} = {\rm{0.207272}}}
\end{array}} \right.

</math> (16)

As shown in Figure 3.2 the error <math>\left\| {z\left( 1 \right) - {z^K}\left( 1 \right)} \right\|</math>， is less than <math>10^-6</math> for <math>N=5</math> and converges with <math>O(h^5)</math>, which is consistent with the expected order <math>2K-1</math>.

(This example follows the work of Biegler and can be found in P293 of “ Nonlinear Programmng”.)

==Conclusion==

In this work, we mainly discussed simultaneous collocation approach for dynamic optimization problems, which formulated the differential equations to a set of algebraic equations. These direct transcription formulations depended on fully discretizing of the differential algebraic equations (DAE), which enabled us solve the simultaneous optimization problem without relying on embedded DAE solvers. Because of this simultaneous formulation, we got the exact first and second order derivatives through the optimization modeling system, and both structure and sparsity can be exploited.

==References==

1. Biegler, Lorenz T. Nonlinear programming: concepts, algorithms, and applications to chemical processes. Vol. 10. SIAM, 2010.

2. Chu, Yunfei, and Fengqi You. "Integration of scheduling and control with online closed-loop implementation: Fast computational strategy and large-scale global optimization algorithm." Computers & Chemical Engineering 47 (2012): 248-268.

3. http://en.wikipedia.org/wiki/Dynamic_programming

4. http://en.wikipedia.org/wiki/Differential_algebraic_equation

5. http://numero.cheme.cmu.edu/uploads/dynopt.pdf

Spatial branch and bound method

2022-04-01T15:33:47Z

Btantisujjatham:

This web page is a duplicate of https://optimization.mccormick.northwestern.edu/index.php/Spatial_branch_and_bound_method

Authors: Ellen Zhuang (ChE 345 Spring 2015)
 
Steward: Dajun Yue, Fengqi You
 
 
'''Spatial branch-and-bound''' is a divide-and-conquer technique used to find the deterministic solution of global optimization problems.1 It is a type of branch-and-bound method, which solves for the set of parameters that globally optimize the objective function, whether that be finding the minimum or maximum value of <math>f(x)</math> or <math>-f(x)</math>, respectively, where x exists over a set of candidate solutions in the feasible region of the problem.

==Introduction==
Branch-and-bound is the systematic enumeration of possible solutions by iteratively searching the space of the problem. The problem’s candidate solutions form a rooted tree. The algorithm analyzes the nodes, or the subsets of the solution set. This step is known as ''branching''. ''Bounding'' is the estimation of a lower and upper bound for the optimal solution at each node. The possible solution at each branch is checked against these bounds, and if it does not produce a better solution than the current best solution found, the solution is discarded in a step called ''pruning''. This algorithm is repeated until the optimal solution is found. The method was first proposed by A. H. Land and A. G. Doig in 1960 for discrete programming, and its first applications were for discrete problems like the traveling salesman problem.

Spatial branch-and-bound techniques solve mixed integer nonlinear programs (MINLP) with nonconvex terms. Unlike ordinary branch-and-bound methods for mixed integer linear programs (MILP), the spatial branch-and-bound method decomposes the nonlinear functions of the original problem symbolically and recursively with simple operators into simple functions.3

{| border="1" class="wikitable"
|+ Comparison of branch-and-bound for MILP and spatial branch-and-bound for nonconvex MINLP.
!
! Branch-and-bound for MILP
! Spatial branch-and-bound for MINLP
|-
! Relaxation
| replace integer variables with bounded continuous variables to form LP || replace nonconvex terms with their convex envelopes to form convex (MI)NLP
|-
! Branching
| on integer variables
| on integer and continuous variables
|-
! Complexity
| NP-hard || NP-hard (but harder)
|-
|}

Spatial branch-and-bound is used to solve global optimization problems of the form:
 

<math>min f(x)</math>

<math>s.t.</math> <math>g_i(x) \le 0</math> where <math> i = 1,2,...,n</math>

<math> x \in S</math>
 
where S is the set of candidate solutions and is a union of the subsets <math>S_j</math> where <math>j = 1,2,...,n</math>. All functions <math>f(x)</math> and <math>g_i(x)</math> are at least continuously differentiable on some open set containing the subregion <math>S_j</math> with bounds <math>s_j^L \le x \le s_j^U</math>, where <math>s_j^L</math> and <math>s_j^U</math> are the lower and upper bounds of <math>S_j</math>, respectively. The feasible space of the problem is where <math>x \in S</math>
and <math>g_i(x) \le 0</math>. The goal is to find the point <math>x^*</math> in the feasible region where <math>f^*(x) \le f(x) + \epsilon</math>, where <math>\epsilon > 0</math> is a given tolerance.

== Algorithm ==

The main algorithmic idea is to divide the problem into smaller subproblems by branching the variables into possible values, creating upper and lower bounds of the function in the certain domain, and finding sequences of approximate solutions which converge to the actual solution. The smaller subproblems are easier to solve than the original problem.2

Convex relaxation is performed on the original nonconvex problem and underestimates the optimal value of the original function. The original problem is then allocated in the root node of the tree. With each iteration, the original problem is restricted , and the subregions are relaxed to make convex.1 For each subregion, the lower and upper bounds of the optimal value of the objective function is found. If the bounds converge, then the global optimum value to the subregion is found. If the bounds do not meet the desired tolerance, the problem is partitioned into smaller subproblems (descendants) by dividing the feasible space of continuous variable (branching variable). The two subproblems are solved by recursive partitioning. If the subproblem does not have the global optimal solution, then it is pruned. The process is continued until the global optimum value of the original problem is found.2

===Convergence Proof===
A branch and bound method using an exact selection rule will converge. Let M be the space of the sample. M is split at some iteration because it consists of an unqualified region that does not include candidate solutions that give the objective function more optimal solutions than the current incumbent solution. Thus, every qualified subregion of Mn contains the optimal solution.1

It is noted that the spacial branch and bound algorithm does not guarantee convergence in a finite number of steps. The algorithm here uses the concept of ɛ-optimality instead of the usual idea of optimality. A solution x* is a global optimum if f(x*) < f(x) for x in the problem space. Because ɛ is positive, xɛ in the problem space is ɛ-globally optimal if the optimum is bounded by m ≤ f(x*) ≤ M where f(xɛ) is in [m,M] and M - m < ε. Finding the ε-global optimum solution ensures a finite termination than a usual global optimum solution.1

Convergence is accelerated with fathoming. For each subregion, the lower bounding solution and the corresponding lower bounding objective function value are found. The region with the lowest objective function value is usually selected. An upper bound to the problem bounds the problem from above. It discards any region where the lower bound of the objective function exceeds the best upper bound. The discarded regions are fathomed.1
 
[[File:sBBpic.png|800px]]

===Step 1: Initialization===
[[File:sBB_flowchart.png|500px|thumb|right|]]

Initialize the subregions Sj to a single region S encompassing the set of all variables and their ranges. Set the convergence tolerance, ɛ > 0, and the best solution U = ∞ and the corresponding solution -∞ < x* < ∞.

===Step 2: Choice of Region===

A subregion Sj of the problem is chosen.

Bound tightening may be performed during the first two steps. For the initialization step, optimization-based bounds tightening involves finding the smallest range of each variable subject to convex relaxation that will still make the problem feasible. This bounds tightening, however, is computationally expensive because 2n convex NLP or linear convex relation problems have to be solved, where n is the number of variables. For the second step, feasibility-based bounds tightening is performed at each region. Here, the variable bounds are tightened with the constraints of the original problem. This type of bounds tightening is computationally less expensive than the former. The spatial branch-and-bound algorithm will still converge without bound tightening, but it may make the process faster.

===Step 3: Lower Bound===

A convex relaxed problem of the original problem in the region chosen is formulated. The problem is reduced to standard form by linearizing the nonlinear terms. The nonlinear terms are replaced by added variables and constraints of type "added variable = nonlinear term". The nonconvex terms are also replaced by convex under- and over- estimators such as convex envelopes. Convex relaxation guarantees a lower bound to the objective function value in that region.

The new problem is solved to give an underestimation l of the objective function with solution xjL. If l > U , the relaxed problem is infeasible and step 2 is repeated.

===Step 4: Upper Bound===

[[File:sBB.png|500px|thumb|right|Since the lower bound (LB) of S2 is greater than the upper bound (UB) of S1, which is the current best solution, so the solution in the region S2 is infeasible and discarded.]]

The original nonconvex problem in the selected region is solved to obtain a local optimum solution xjU with objective function u . The subproblem can be solved numerically or by branching. If the spatial branch-and-bound algorithm has added variables, branching can be done on those. Convex relaxation is done, and the lower and upper bounds are found. These bounds are not dependent on the added variables. Branching can also be done on the original variables. The range of the original problem [xL, xU] is partitioned into [xL, x'] and [x', xU]. The upper bounding problem is solved in each subregion.

Algorithms that solve nonconvex NLPs are fragile and can fail even if a local optimum solution exists. If the function is unable to be solved, set u = ∞ and -∞ < xjU < ∞.

===Step 5: Pruning===

If U > u , set x* = xU and U = u . All the regions with lower bounds greater than U are discarded.

===Step 6: Check Region===

If u - l ≤ ɛ, then u is a global minimum of the region, and the algorithm returns to step 2. If not, then the global minimum of the subregion is not found, and the algorithm proceeds to the next step.

===Step 7: Branching===

The region is split into sub-regions, and the algorithm returns to step 2. There are several strategies for branching, and most consist of determining the set of variables on which to branch and the variables whose domain is sub-divided by the branching operation.

==Example1==
[[File:sBBexample.png|200px|thumb|right|The global optimization problem min f(x) = 1/4 x + sin(x).1]]

 
<math>min</math> <math>f(x) </math> <math>= 1/4 x + </math><math>sin(x)</math>

<math>s.t.</math> <math>-3 \le x \le 6</math>
 

The above NLP has two minima, one of which is global. Using the spatial branch-and-bound approach, the space of the problem is divided into subregions, within which the lower and upper bounds of the optimum is found. If the bounds are not within a given <math>\epsilon < 0</math> tolerance (that is, the bounds are not sufficiently close), then the subregion is partitioned and the process is repeated. In this example, let <math>\epsilon = 0.15</math>.

===First Iteration===
Consider the region that encompasses the entire range <math>-3 \le x \le 6</math>. The objective function is underestimated with the convex underestimator <math>1/4 x - 1</math>. To tighten the bounds, tangents to <math>f(x)</math> are added:
<center> 
<math>-0.74x - 3.11</math> at <math>x = -3</math>
 
<math>1.21x - 6.04</math> at <math>x = 6</math>. </center>
 The convex estimator is then
<math>f^c(x) = \{-0.74x - 3.11, 1/4 x - 1, 1.21x - 6.04\}</math>.

The minimum of the convex problem is found to be <math>l = -1.53 </math> at <math>x^L=-2.13</math>.

Next, the upper bound of the problem is solved locally using Newton's method with <math>x = 6</math> as the starting point. The upper bound is found to be <math>u = 0.147</math> at <math>x^U = 4.46</math>. <math>u</math> and <math>l</math> are not sufficiently close since <math>u - l = 1.67 >1 </math>, so it is not certain that <math>x^U</math> is a global optimum of the region.

The region is partitioned into two subregions, using <math>x^U = 4.46</math> as the branching point. The two unexamined regions are now <math>-3 \le x \le 4.46</math> and <math>4.46 \le x \le 6</math>.

[[File:sBBexample2.png|200px|thumb|center|The convex relaxation and lower bound of the first region examined.1]]

===Second Iteration===

Here, let <math>-3 \le x \le 4.46</math> be the first region to be examined. The lower and upper bounds are found to be <math>l = -1.53</math> (as before) and <math>u = -1.43</math> at <math>x^U = -1.823</math>. Now, <math>u - l = 0.11 < \epsilon</math>, so <math>x^U = -1.823</math> is accepted as the global optimum of the region analyzed. The best solution found thus far is now <math>x^* = -1.823</math> and <math>U = -1.42</math>. Since a global optimum exists, the region does not need to be branched.

[[File:sBBexample3.png|200px|thumb|center|The second iteration.1]]

===Third Iteration===

The region <math>4.46 \le x \le 6</math> is examined. The lower and upper bounds are computed to be <math>l = 1.11</math> at <math>x^L = 4.46</math> and <math>u = 1.147</math> at <math>x^U = 4.46</math>. Since <math>u - l = 0.04 < \epsilon</math>, the solution found is the optimum of the subregion, and branching is not necessary. The best solution found thus far does not need to be updated since <math>U < u</math>, so this region is discarded. The algorithm converges at the global optimum <math>x^* = -1.823</math>.

[[File:sBBexample4.png|200px|thumb|center|The third iteration.1]]

This example illustrates important features of the spatial branch-and-bound approach. First, the convex relaxation can be made tighter in each region. Second, in problems with multiple variables, the choice of the variable on which to branch is important. Third, the subregions are pruned when its lower bound is higher than the best solution found thus far.

==Applications==
Spatial branch-and-bound is used to solve a number of problems, including
*Pooling and Blending Problems: An example of a pooling/blending problem is the mixing of various types of crude oils with different qualities from different feeds. These streams are mixed in various pools. The parameters of the mixing are set to minimize cost while meeting quality requirement and quantity demand.
*Euclidean Location Problems: An example of this problem is finding the geographical positions of a plant that minimizes transportation costs.
*Kissing Number Problem: The objective of the kissing number problem is to find the maximum number of non-overlapping n-dimensional spheres in (n+1)-dimensional Euclidian space that can be arranged such that each is touching another. In 1-dimension, the kissing number is 2; in 2-dimension, 6; and in 3-dimension, 12.

==Conclusion==

Accurate models of real problems often involve nonconvex terms in either the objective function or the constraints. Nonconvex problems are one of the hardest to solve. Since there are multiple local optima solutions, it is difficult to find which solution is the best. Global optimization approaches, such as spatial branch-and-bound, are used to solve such problems. Other algorithms have been derived from the original spatial branch-and-bound. Examples include branch-and-reduce, α branch-and-bound, symbolic reformulation, branch-and-contract, and reduced spatial branch-and-bound.

==References==

#Liberti, Leo. "Introduction to global optimization." Lecture of Ecole Polytechnique, Palaiseau F 91128 (2008): 12.
#Pozo, C., et al. "A spatial branch-and-bound framework for the global optimization of kinetic models of metabolic networks." Industrial & Engineering Chemistry Research 50.9 (2010): 5225-5238.
#Stein, Oliver, Peter Kirst, and Paul Steuermann. "An Enhanced Spatial Branch-and-Bound Method in Global Optimization with Nonconvex Constraints." (2013).

Logarithmic transformation

2022-04-01T15:33:04Z

Btantisujjatham:

This web page is a duplicate of https://optimization.mccormick.northwestern.edu/index.php/Logarithmic_transformation

Author: Hassan Ali (ChE 345 Spring 2015)

Steward: Dajun Yue, Fengqi You

'''Logarithmic transformation''' is a method used to change geometric programs into their convex forms. A '''geometric program''', or '''GP''', is a type of global optimization problem that concerns minimizing a subject to constraint functions so as to allow one to solve unique non-linear programming problems. All geometric programs contain functions called posynomials that are inherently non-convex. Due to this fact, solving geometric problems can be computationally intensive and finding a global optimum solution is not guaranteed. However, by creating a logarithmic transformation for a problem, one can solve for the globally optimum solution quicker and easier. A logarithmic transformation is not the only transformation which allows. One can also use an exponential transformation to obtain the same result. A logarithmic transformation can also be used on signomial programs, which are an extension of geometric programs.1,2,3

==Background==

===Geometric Programming===

Geometric programming was first discussed in the early 1960's as a class of optimization problem that could be solved with geometric inequalities. Duffin and his colleagues went on to formulate basic theories of geometric programming and its applications in 1967 in their seminal textbook ''Geometric Programming Theory and Application''.4 One of their insights into geometric programming was noticing that problems with highly non-linear constraints could be stated equivalently with a '''dual program'''. In other words, a global minimizing solution could be found by solving its corresponding dual maximization problem. This is because the dual constraints are linear. All one would need to do is change the geometric programming problem from its standard posynomial form into this "dual form" and solve using these now linear constraints.3 The answer could only be locally optimum as one was still dealing with '''non-linear programming (NLP)''' methods.

A standard GP problem takes the following form

:<math> \min ~f_0(x)</math>
:<math>s.t. ~~ f_i(x) \le 1, i = 1,...,m </math>
:::<math>h_j(x)=1, j=1,...,M</math>

where <math> f_0,...,f_m</math> are posynomials and <math> h_1,...,h_M</math> are monomials, and the variables '''x''' are all positive.

===Posynomials===

'''Posynomials''' are functions that are non-linear and consist of a set of constants multiplied to a series of multiple variables multiplied together. Often these variables are raised to various powers. More specifically, they are functions of the form:

<math> f(x_1,x_2,...,x_n)=\sum_{k=1}^K c_kx_1^{a_{1k}}...x_n^{a_{nk}} </math>

where the variables <math> x_i </math> and the coefficients <math> c_k </math> are positive, real numbers, and all of the exponents <math> a_{ik} </math> are real numbers.4

'''Monomials''' are a specific case of posynomial where there is no set of terms but rather only one term and its constant. A posynomial is then simply a sum of monomials. An example of a monomial is

<math> f(x_1,x_2)=400x_1^{3}x_2^{-2} </math>

and an example of a posynomial is

<math> f(x_1,x_2)=4x_1^{2}x_2^{4}+x_1^{3}x_2^{5} </math>

A posynomial is therefore not of the form

<math> f(x_1,x_2)=x_1-x_2 </math>

as this is a linear function.

===Geometric Programming Derivation===

GP problems are not always given and sometimes must be derived.5 Take the following example of a standard NLP problem

:<math> \max ~x/y</math>
:<math> s.t. ~~2\le x\le 3</math>
:::<math> ~x^2 + 3y/z\le \sqrt{y} </math>
:::<math> ~x/y = z^2 </math>

where the variables '''x,y, and z''' are positive. The standard GP form of this problem would be

:<math> \max ~x^{-y}</math>
:<math> s.t. ~~2x^-1\le 1, (1/3)x\le 1 </math>
:::<math> ~x^{2}y^{-0.5} + 3y^{0.5}z^{-1}\le 1 </math>
:::<math> ~xy^{-1}z^{-2} = 1 </math>

A standard GP problem then minimizes a posynomial function while constraining it to posynomial inequality constraints and monomial equality constraints. Posynomials are positive functions and contain log convexity. This is an important facet as it allows standard GP problems to undergo log transformations.1

==Logarithmic Transformation==

The Duffin "dual-program" method for solving GP problems is still in use but as logarithmic and exponential transformation methods were understood it became easier to simply use them to change the standard GP problem into a convex GP problem and solve using an interior point method. '''Interior point methods''' can solve GP problems very fast and robust as they require essentially no tuning of the parameters. Most importantly, the final solution obtained through this method is guaranteed to be the global optimum solution. Solving a standard GP problem is like solving an NLP problem. The only difference being that a GP problem is more constrained in its use of posynomials and monomials in its objective function and constraints. This makes standard GP problems more efficient to solve and they can be solved relatively quickly, but like NLP problems there is no guarantee the solution is globally optimum. In addition, an initial guess needs to be provided and parameters must be carefully chosen.5

Logarithmic transformations are therefore very helpful as they transform the standard GP form into its convex form based on a logarithmic change of variables and a logarithmic transformation of the objective and constraint functions. So instead of minimizing the objective <math>f_0</math>, the logarithm of <math>f_0</math> is minimized. The variable '''x''' is replaced with the its logarithm <math>y = \log x</math>. So now <math>x = e^y</math>. The inequality constraint is now <math>\log f_i \le 0</math> instead of <math>f_i \le 0</math> and the equality constraint is now <math>\log h_j=0</math> instead of <math>h_j=0</math>.5

Therefore the convex form of a GP problem is

:<math> \min ~\log f_0(e^y)</math>
:<math>s.t. ~~ \log f_i(e^y) \le 0, i = 1,...,m </math>
:::<math>\log h_j(e^y)=0, j=1,...,M</math>

where <math> f_0,...,f_m</math> are posynomials and <math> h_1,...,h_M</math> are monomials, and the variables '''y''' are all positive. This reformulation can occur as long as the constants in the function are positive.

===Methods===

[[File:Lse_plot.png|thumb|right|upright=1.7|The log-sum-exp function in <math>R^2</math>.]]

To see the transformation clearer, more '''detailed steps''' can be shown. If the objective function is a '''monomial''' of the format

<math>f_0(x) = cx_1^{a_1}x_2^{a_2}...x_n^{a_n}</math>

then the new objective function will be a function of a new variable '''y''' where <math>y = \log x</math> and <math>x = e^y</math>. It will also be a logarithm of the original objective function so

<math>f(y) = \log f_0(e^y)</math>

where

<math>\log f_0(e^y) = \log c +a_1\log x_1 +...+ a_n\log x_n = \log c +a_1y_1 +...+ a_ny_n</math>

which is an affine function of '''y'''

If the above objective function was instead a monomial equality constraint such that

<math>g(x) = cx_1^{a_1}x_2^{a_2}...x_n^{a_n}=1</math>

then the new equality constraint would be equal to zero and be simplified to a linear equation of the form

<math>-\log c = a_1y_1 +...+ a_ny_n</math>

Any GP with only monomials reduces to a linear program after a logarithmic transformation. Checking for linearity in a monomial case will confirm if the transformation was done correctly.5

If the GP has '''posynomials''' the problem becomes more complex. The log of multiple terms cannot be simplified beyond its logarithm form. So if the objective function is

<math>f_0(x) = \sum_{n=1}^N c_nx_n^{a_n}</math>

where '''c''' > 0, then the new objective function will be

<math>f(y) = \log f_0(e^y)</math>

where

<math>\log f_0(e^y) = \log \left ( \sum_{n=1}^N e^{a_ny_n+b_n} \right ) </math>

and where <math>b_n = \log c_n </math> for <math>n = 1,...,N</math>. The above can be written in a simpler form as

<math>\log f_0(e^y) = lse(Ay+b)</math>

where A is an N by n matrix with rows <math>a_1,...,a_N</math> and lse is the [https://inst.eecs.berkeley.edu/~ee127a/book/login/def_lse_fcn.html log-sum-exp function] which is convex.6

The reason the logarithmic transformation is done when an '''exponential transformation''' can be done in less steps is that the exponential function might take in large values. This may create numerical problems as well as lead to complications in optimization programs. Taking a logarithm allows for the recovery of smaller values, which gives the logarithmic transformation a distinct advantage.6

===Examples===
A few objective functions and constraints are provided in this section as illustrative examples.

'''EXAMPLE 1'''

:<math> \min ~f_0(x_1,x_2)=4x_1^{2}x_2^{4}</math>
:<math>s.t. ~~ x_1 \ge 0 </math>
:::<math> x_2 \ge 0 </math>

After a logarithmic transformation, this GP becomes:
:<math> \min ~f(y_1, y_2)= \log 4 + 2y_1 + 4y_2 </math>
:<math>s.t. ~~ y_1 \ge 0 </math>
:::<math> y_2 \ge 0 </math>

'''EXAMPLE 2'''

:<math> \min ~f_0(x_1,x_2)=x_1^{2}x_2 + x_1x_2</math>
:<math>s.t. ~~ x_1^{2}x_2^{-2} \ge 1 </math>

After a logarithmic transformation, this GP becomes:
:<math> \min ~f(y_1, y_2)= \log \left ( \sum_{n=1}^2 e^{a_ny_n} \right ) </math>
:<math>s.t. ~~ 2y_1 - 2y_2 \ge 0 </math>

'''EXAMPLE 3'''

:<math> \min ~f_0(x_1,x_2)=16x_1^{2}x_2^{5}</math>
:<math>s.t. ~~ \frac{x_1}{x_2} = 1 </math>

After a logarithmic transformation, this GP becomes:
:<math> \min ~f(y_1, y_2)= \log 16 + 2y_1 + 5y_2 </math>
:<math>s.t. ~~ y_1=y_2 </math>

To prove that the answers to the above examples are indeed convex, one could verify its convexity through a positive-definiteness test of the [http://en.wikipedia.org/wiki/Hessian_matrix/ Hessian]. The reformulations can also be tested in GAMS.

==Feasibility Analysis==

A standard GP problem can be infeasible i.e. the constraints are too "tight" and don't allow for a solution. Because the constraints from a standard GP problem hold over even when changed to its convex form via logarithmic transformation, an infeasible GP problem will always be infeasible regardless of convexity. This is why it is important to check for infeasibility prior to creating a linear transformation. This is done by undergoing a feasibility analysis and setting up the GP as follows:

:<math> \min ~s </math>
:<math>s.t. ~~f_i(x) \le s, i = 1,...,m</math>
:::<math>g_i(x)=1, i=1,...,p</math>
:::<math> s \ge 1</math>

As s nears a value of 1, the original problem nears feasibility. The final goal is to find a solution such that s=1 so that feasibility of the problem is confirmed.1

==Applications==

Applications of geometric programming include: 

<OL>

<LI>'''Electrical Engineering'''5
<UL>
<LI>Power Control
<LI>Wire Sizing
<LI>Routing
<LI>Optimal doping profile in semiconductor device engineering
<LI>Digital circuit gate sizing
</UL>
<LI>'''Chemical Engineering'''5
<UL>
<LI>Optimal reactor design
<LI>Mass Transfer Optimization
<LI>Kinetics
<LI>Maximizing reliability of reactors
</UL>
<LI>Other
<UL>
<LI>Transportation6
<LI>Economic Models
<LI>Inventory Models
</UL>
</OL>

A great example among these varied applications concerns optimal reactor design. The chemical systems within a reactor follow non convex kinetics equations. If one wanted to optimize a reactor for a given system with given reaction rates, one could do so easily with a logarithmic transformation. Take a reaction '''A + B to C''' with reaction rate <math>r = kC_AC_B</math>. The logarithmic transformation that would allow one to obtain a convex relation is <math>r^*=\log k + c_A +c_B</math> where <math>c = \log C</math>. With this new equation, you can optimize your reactor design for your given system.

== Conclusion ==

Geometric Programming is an application that can solve a varying degree of difficult optimization problems but due to the nature of their complexity, these problems cannot be solved easily. Solving geometric problems can be computationally intensive and finding a global optimum solution is not guaranteed. This is due to the fact that GPs involve posynomials and monomials which are by nature non-linear. However, by creating a '''logarithmic transformation''' for a problem, one can solve for the globally optimum solution quicker and easier because the GP is changed to its convex form. Logarithmic transformations offer an advantage over other transformations as it allows one to work with smaller values which are less problematic. Logarithmic transformations can be used wherever geometric programming is used which includes applications in chemical engineering and economics.

== References ==

# Chiang, M. (2005). ''Geometric Programming for Communication Systems'', Publishers, Inc., ISBN 1-933019-09-3.
# Duffin, R.J. (1970). "Linearizing Geometric Programs", ''SIAM Review'' '''12''' (2).
# Biswal, K.K., Ohja, A.K. (2010). "Posynomial Geometric Programming Problems with Multiple Parameters", ''Journal of Computing'' '''2''' (1).
# Duffin, R.J., Peterson, E.L., Zener, C.M. (1967). ''Geometric Programming Theory and Application'', John Wiley & Sons Inc., ISBN 978-0471223702.
# Boyd, S., Hassibi, A., Kim, S.J., Vandenberghe, L. (2007). "A Tutorial on Geometric Programming", ''Optim Eng'' '''8''': 67-127.
# Calafiore, G.C., El Ghaoui, L. (2014). ''Optimization Models and Applications'', Cambridge University Press., ISBN 978-1107050877.

File:Matt4.png

2022-04-01T15:31:24Z

Btantisujjatham:

Matt4

File:Matt3.jpg

2022-04-01T15:31:18Z

Btantisujjatham:

Matt3

File:Matt2.jpg

2022-04-01T15:31:11Z

Btantisujjatham:

Matt2

Signomial problems

2022-04-01T15:30:53Z

Btantisujjatham: Created page with "Author: Matthew Hantzmon (Che_345 Spring 2015) =Introduction= Sigmoid problems are a class of optimization problems with the objective of maximizing the sum of multiple sigmo..."

Author: Matthew Hantzmon (Che_345 Spring 2015)

=Introduction=
Sigmoid problems are a class of optimization problems with the objective of maximizing the sum of multiple sigmoid functions. They are defined by their limits at negative and positive infinity.
Similar to the unit step function the function approaches 1 as it approaches infinity and approaches -1 as it approaches negative infinity. Sigmoid functions are shaped like an “S”, having both a convex and concave portion. The concave attributes will make solving the problems require a non-linear optimization, increasing computational burden. Though sigmoid problems are harder to solve than ordinary convex programs, they have many useful real-world applications which have encouraged their development. Sigmoid problems have become especially useful in the creation of artificial neural networks that simulate learning. The functionality of sigmoid problems allow in statistical models and other artificial learning.

The S-shape of the individual sigmoid functions made them good representatives of economies of scale, or other situations where the supplier is facing a declining demand function and increased investment leads to a lower average cost to produce each good. Within the sigmoidal problem, the sigmoidal functions within the objective represent that initial increases in investment will increase profitability when there is excess demand at the current price. The inflection point in the sigmoidal functions represents the point where the marginal profit for producing another good becomes 0. At this point the decrease in marginal cost from investing in more production is matched by an equal decrease in willingness to pay by the marginal consumer. Past this point the function determines that increased investment will not increase profitability. These properties of the sigmoid functions allow them to be ideal representatives of situations where initial investment is profitable though a threshold is reached where no additional inputs will be profitable. Other situations that exhibit these characteristics and can be modeled with sigmoidal problems include election planning, lottery prize design, and bidding at an auction. Election planners always desire to find the locations where spending on advertising will have the greatest effect on election results. When designing a lotter, the company wants to design a prize value that encourages many people to buy tickets while still being net profitable for the lottery. At an auction, a bidder may not have enough capital to buy every item they may desire so it is important early on to not waste a disproportionate amount of money winning an item that is not important. A sigmoidal problem maximizing utility can help determine the threshold value to bid on each item.

=Sigmoidal Functions=
[[File:matt4.png|thumb|right|350px|Figure 1: Logistic Sigmoid Function]]
Generally, a sigmoid function is any function having an “S” shape. One portion of the function will be convex while the other portion will be concave. The most common example of the sigmoid function is the logistic function shown in figure 1. However, the definition of sigmoidal functions is very broad: Any function that has real values, one inflection point and has a bell shaped first derivative function. With this definition, all logistic functions, the error function, the cumulative distribution function and many more can be considered sigmoidal functions and may be included in the objective function of a sigmoidal problem.
In figure 1, the coefficient 1/c is the temperature constant. As the value of c increases in the sigmoid function, the sigmoid function will approach the shape of a step function.

=Sigmoidal Problems=
[[File:matt2.jpg|thumb|right|350px|Figure 2: "S" Shape Curve]]
In practice, sigmoid problems consist of an optimization where the objective function is the sum of a set of sigmoid functions. Each of the sigmoid functions only has one input variable that determines the output value. Unless a certain threshold of input value has been exceeded, usually the value of the sigmoid function will be very close to zero. As a result, the solution of a sigmoid problem will consist of all the function input values either being 0 or a number exceeding the sigmoid function threshold. This threshold is depicted as being near z in figure 2. When optimizing inputs to achieve a desired output, usually there is a scarcity of inputs that must be allocated optimally; otherwise there would be no cause for a problem. Thus, the solution of the optimization will exceed the threshold z in as many indexes as possible. Any indexes that were not able to reach z input usually would have an input of 0. This is because, below z total inputs, two functions each with one unit of input will always have a lower combined utility than one sigmoid function with two units of input.

==Applications of Sigmoidal Problems==
These characteristics of sigmoidal problems make them very useful for certain types of optimizations. In presidential election planning, it is always important to spend campaign advertising dollars in a way which will have the largest impact on electoral count. Since the format of our election is winner-take-all in each state, the states that have the tightest poll results would be the states where the smallest change in poll numbers would most greatly change the overall electoral count. If a campaign team were trying to optimize their advertising spending they would want to create an objective that illustrated how their advertising would have an effect on each state. The campaign would only want to spend money on states that have a feasible change of voting in favor of their candidate. Money spend on states eventually lost is completely wasted. Sigmoidal functions will behave in the same way, only recommending to spend money in a state if they can pass the threshold of getting a majority of voters. They will predict that it is better to spend a lot of money on one state they have a good chance to influence than spend half each on two states where they may end up losing both.

The same logic follows for why sigmoidal problems are useful for optimizing bidding within an auction. Utility is only gained when a desired item is won, so it is always important to set up a function that only calculates utility when the necessary bidding threshold has been passed. There is no use or utility gained from bidding half the value on every item, and sigmoidal functions would appropriately capture this fact.

Sigmoid functions can be very useful in situations where utility gained is not continuous based on proportion but is more of a binary either/or result. They are not as extreme of a binary result as a step function, and the fact that they are fully differentiable makes them much more easily solved. However, since they have concave portions, their solution is not as simple as a linear program and may require software or approximations.

Large arrays of sigmoid problems have also been combined to simulate an artificial neural networks. These neural networks are sets of sigmoid functions used to approximate interconnected neurons, which are very powerful in statistical modeling. The neural networks are created by the aggregation of sigmoidal functions using the back-propagation algorithm. The most common applications of the sigmoidal problems is the density problem. The density of a linear space with uniform convergence can be calculated for a compact set using Cybenko's approximation theorem.

==Solving Sigmoidal Problems==
[[File:matt3.jpg|thumb|right|350px|Figure 3: Concave Envelope]]
As one of the properties of a sigmoid function is concave portions, simple simplex optimization will not suffice. As computation power has increased in recent decades, solutions of sigmoidal problems has become much more feasible using software available online. Though, solutions with the appropriate software are fairly straightforward and simple, these sorts of problems can be highly difficult to solve by hand. They can rarely be solved at all in the original format and frequently need several creative approximations to leave a problem that is computable by hand.
One method that allows simpler by-hand computation is the creation of a convex envelope. Depicted in figure 3, the concave envelope is an approximation of the sigmoidal function that adds a linear portion where the function had been previously convex. This addition, greatly simplifies the function’s computational difficulty. However, it must be noticed, that this approximation will only give an upper bound on the solution as it is including area that it not part of the original constraints.
Even with function simplification and approximation, usually there is a large set of functions to be optimized. Depending on the level of approximation, the solution found may not even contain a relevant level of accuracy. In general, software will be the method of choice when dealing with sigmoidal problems.

=References=
1. J. Birge, F. Louveaux, Introduction to Stochastic Programming, Springer, 2011.

2. Udell, Madeline, and Stephen Boyd. "Maximizing a Sum of Sigmoids." Stanford University, May 2015. Web. May 2015. <http%3A%2F%2Fweb.stanford.edu%2F~udell%2Fdoc%2Fmax_sum_sigmoids>.

3. Srivastava, Vaibhav, and Francesco Bullo. "Knapsack Problems with Sigmoid Utilities." Princeton, July 2013. Web. May 2015. <http%3A%2F%2Fwww.princeton.edu%2F~vaibhavs%2Fpapers%2F2012d-sb>.

4. Rojas, R. "Backpropagation." SpringerReference (2011): n. pag. UBerlin. UBerlin. Web.

File:SBBpic.png

2022-04-01T15:29:23Z

Btantisujjatham:

SBBpic

File:SBBexample4.png

2022-04-01T15:29:16Z

Btantisujjatham:

SBBexample4

File:SBBexample3.png

2022-04-01T15:29:08Z

Btantisujjatham:

SBBexample3

File:SBBexample2.png

2022-04-01T15:29:00Z

Btantisujjatham:

SBBexample2

File:SBBexample.png

2022-04-01T15:28:51Z

Btantisujjatham:

SBBexample

File:SBB flowchart.png

2022-04-01T15:28:44Z

Btantisujjatham:

SBB flowchart

File:SBB.png

2022-04-01T15:28:35Z

Btantisujjatham:

SBB

Spatial branch and bound method

2022-04-01T15:27:20Z

Btantisujjatham: Created page with "Authors: Ellen Zhuang (ChE 345 Spring 2015) Steward: Dajun Yue, Fengqi You '''Spatial branch-and-bound''' is a divide-and-conquer technique used to find the det..."

Authors: Ellen Zhuang (ChE 345 Spring 2015)
 
Steward: Dajun Yue, Fengqi You
 
 
'''Spatial branch-and-bound''' is a divide-and-conquer technique used to find the deterministic solution of global optimization problems.1 It is a type of branch-and-bound method, which solves for the set of parameters that globally optimize the objective function, whether that be finding the minimum or maximum value of <math>f(x)</math> or <math>-f(x)</math>, respectively, where x exists over a set of candidate solutions in the feasible region of the problem.

==Introduction==
Branch-and-bound is the systematic enumeration of possible solutions by iteratively searching the space of the problem. The problem’s candidate solutions form a rooted tree. The algorithm analyzes the nodes, or the subsets of the solution set. This step is known as ''branching''. ''Bounding'' is the estimation of a lower and upper bound for the optimal solution at each node. The possible solution at each branch is checked against these bounds, and if it does not produce a better solution than the current best solution found, the solution is discarded in a step called ''pruning''. This algorithm is repeated until the optimal solution is found. The method was first proposed by A. H. Land and A. G. Doig in 1960 for discrete programming, and its first applications were for discrete problems like the traveling salesman problem.

Spatial branch-and-bound techniques solve mixed integer nonlinear programs (MINLP) with nonconvex terms. Unlike ordinary branch-and-bound methods for mixed integer linear programs (MILP), the spatial branch-and-bound method decomposes the nonlinear functions of the original problem symbolically and recursively with simple operators into simple functions.3

{| border="1" class="wikitable"
|+ Comparison of branch-and-bound for MILP and spatial branch-and-bound for nonconvex MINLP.
!
! Branch-and-bound for MILP
! Spatial branch-and-bound for MINLP
|-
! Relaxation
| replace integer variables with bounded continuous variables to form LP || replace nonconvex terms with their convex envelopes to form convex (MI)NLP
|-
! Branching
| on integer variables
| on integer and continuous variables
|-
! Complexity
| NP-hard || NP-hard (but harder)
|-
|}

Spatial branch-and-bound is used to solve global optimization problems of the form:
 

<math>min f(x)</math>

<math>s.t.</math> <math>g_i(x) \le 0</math> where <math> i = 1,2,...,n</math>

<math> x \in S</math>
 
where S is the set of candidate solutions and is a union of the subsets <math>S_j</math> where <math>j = 1,2,...,n</math>. All functions <math>f(x)</math> and <math>g_i(x)</math> are at least continuously differentiable on some open set containing the subregion <math>S_j</math> with bounds <math>s_j^L \le x \le s_j^U</math>, where <math>s_j^L</math> and <math>s_j^U</math> are the lower and upper bounds of <math>S_j</math>, respectively. The feasible space of the problem is where <math>x \in S</math>
and <math>g_i(x) \le 0</math>. The goal is to find the point <math>x^*</math> in the feasible region where <math>f^*(x) \le f(x) + \epsilon</math>, where <math>\epsilon > 0</math> is a given tolerance.

== Algorithm ==

The main algorithmic idea is to divide the problem into smaller subproblems by branching the variables into possible values, creating upper and lower bounds of the function in the certain domain, and finding sequences of approximate solutions which converge to the actual solution. The smaller subproblems are easier to solve than the original problem.2

Convex relaxation is performed on the original nonconvex problem and underestimates the optimal value of the original function. The original problem is then allocated in the root node of the tree. With each iteration, the original problem is restricted , and the subregions are relaxed to make convex.1 For each subregion, the lower and upper bounds of the optimal value of the objective function is found. If the bounds converge, then the global optimum value to the subregion is found. If the bounds do not meet the desired tolerance, the problem is partitioned into smaller subproblems (descendants) by dividing the feasible space of continuous variable (branching variable). The two subproblems are solved by recursive partitioning. If the subproblem does not have the global optimal solution, then it is pruned. The process is continued until the global optimum value of the original problem is found.2

===Convergence Proof===
A branch and bound method using an exact selection rule will converge. Let M be the space of the sample. M is split at some iteration because it consists of an unqualified region that does not include candidate solutions that give the objective function more optimal solutions than the current incumbent solution. Thus, every qualified subregion of Mn contains the optimal solution.1

It is noted that the spacial branch and bound algorithm does not guarantee convergence in a finite number of steps. The algorithm here uses the concept of ɛ-optimality instead of the usual idea of optimality. A solution x* is a global optimum if f(x*) < f(x) for x in the problem space. Because ɛ is positive, xɛ in the problem space is ɛ-globally optimal if the optimum is bounded by m ≤ f(x*) ≤ M where f(xɛ) is in [m,M] and M - m < ε. Finding the ε-global optimum solution ensures a finite termination than a usual global optimum solution.1

Convergence is accelerated with fathoming. For each subregion, the lower bounding solution and the corresponding lower bounding objective function value are found. The region with the lowest objective function value is usually selected. An upper bound to the problem bounds the problem from above. It discards any region where the lower bound of the objective function exceeds the best upper bound. The discarded regions are fathomed.1
 
[[File:sBBpic.png|800px]]

===Step 1: Initialization===
[[File:sBB_flowchart.png|500px|thumb|right|]]

Initialize the subregions Sj to a single region S encompassing the set of all variables and their ranges. Set the convergence tolerance, ɛ > 0, and the best solution U = ∞ and the corresponding solution -∞ < x* < ∞.

===Step 2: Choice of Region===

A subregion Sj of the problem is chosen.

Bound tightening may be performed during the first two steps. For the initialization step, optimization-based bounds tightening involves finding the smallest range of each variable subject to convex relaxation that will still make the problem feasible. This bounds tightening, however, is computationally expensive because 2n convex NLP or linear convex relation problems have to be solved, where n is the number of variables. For the second step, feasibility-based bounds tightening is performed at each region. Here, the variable bounds are tightened with the constraints of the original problem. This type of bounds tightening is computationally less expensive than the former. The spatial branch-and-bound algorithm will still converge without bound tightening, but it may make the process faster.

===Step 3: Lower Bound===

A convex relaxed problem of the original problem in the region chosen is formulated. The problem is reduced to standard form by linearizing the nonlinear terms. The nonlinear terms are replaced by added variables and constraints of type "added variable = nonlinear term". The nonconvex terms are also replaced by convex under- and over- estimators such as convex envelopes. Convex relaxation guarantees a lower bound to the objective function value in that region.

The new problem is solved to give an underestimation l of the objective function with solution xjL. If l > U , the relaxed problem is infeasible and step 2 is repeated.

===Step 4: Upper Bound===

[[File:sBB.png|500px|thumb|right|Since the lower bound (LB) of S2 is greater than the upper bound (UB) of S1, which is the current best solution, so the solution in the region S2 is infeasible and discarded.]]

The original nonconvex problem in the selected region is solved to obtain a local optimum solution xjU with objective function u . The subproblem can be solved numerically or by branching. If the spatial branch-and-bound algorithm has added variables, branching can be done on those. Convex relaxation is done, and the lower and upper bounds are found. These bounds are not dependent on the added variables. Branching can also be done on the original variables. The range of the original problem [xL, xU] is partitioned into [xL, x'] and [x', xU]. The upper bounding problem is solved in each subregion.

Algorithms that solve nonconvex NLPs are fragile and can fail even if a local optimum solution exists. If the function is unable to be solved, set u = ∞ and -∞ < xjU < ∞.

===Step 5: Pruning===

If U > u , set x* = xU and U = u . All the regions with lower bounds greater than U are discarded.

===Step 6: Check Region===

If u - l ≤ ɛ, then u is a global minimum of the region, and the algorithm returns to step 2. If not, then the global minimum of the subregion is not found, and the algorithm proceeds to the next step.

===Step 7: Branching===

The region is split into sub-regions, and the algorithm returns to step 2. There are several strategies for branching, and most consist of determining the set of variables on which to branch and the variables whose domain is sub-divided by the branching operation.

==Example1==
[[File:sBBexample.png|200px|thumb|right|The global optimization problem min f(x) = 1/4 x + sin(x).1]]

 
<math>min</math> <math>f(x) </math> <math>= 1/4 x + </math><math>sin(x)</math>

<math>s.t.</math> <math>-3 \le x \le 6</math>
 

The above NLP has two minima, one of which is global. Using the spatial branch-and-bound approach, the space of the problem is divided into subregions, within which the lower and upper bounds of the optimum is found. If the bounds are not within a given <math>\epsilon < 0</math> tolerance (that is, the bounds are not sufficiently close), then the subregion is partitioned and the process is repeated. In this example, let <math>\epsilon = 0.15</math>.

===First Iteration===
Consider the region that encompasses the entire range <math>-3 \le x \le 6</math>. The objective function is underestimated with the convex underestimator <math>1/4 x - 1</math>. To tighten the bounds, tangents to <math>f(x)</math> are added:
<center> 
<math>-0.74x - 3.11</math> at <math>x = -3</math>
 
<math>1.21x - 6.04</math> at <math>x = 6</math>. </center>
 The convex estimator is then
<math>f^c(x) = \{-0.74x - 3.11, 1/4 x - 1, 1.21x - 6.04\}</math>.

The minimum of the convex problem is found to be <math>l = -1.53 </math> at <math>x^L=-2.13</math>.

Next, the upper bound of the problem is solved locally using Newton's method with <math>x = 6</math> as the starting point. The upper bound is found to be <math>u = 0.147</math> at <math>x^U = 4.46</math>. <math>u</math> and <math>l</math> are not sufficiently close since <math>u - l = 1.67 >1 </math>, so it is not certain that <math>x^U</math> is a global optimum of the region.

The region is partitioned into two subregions, using <math>x^U = 4.46</math> as the branching point. The two unexamined regions are now <math>-3 \le x \le 4.46</math> and <math>4.46 \le x \le 6</math>.

[[File:sBBexample2.png|200px|thumb|center|The convex relaxation and lower bound of the first region examined.1]]

===Second Iteration===

Here, let <math>-3 \le x \le 4.46</math> be the first region to be examined. The lower and upper bounds are found to be <math>l = -1.53</math> (as before) and <math>u = -1.43</math> at <math>x^U = -1.823</math>. Now, <math>u - l = 0.11 < \epsilon</math>, so <math>x^U = -1.823</math> is accepted as the global optimum of the region analyzed. The best solution found thus far is now <math>x^* = -1.823</math> and <math>U = -1.42</math>. Since a global optimum exists, the region does not need to be branched.

[[File:sBBexample3.png|200px|thumb|center|The second iteration.1]]

===Third Iteration===

The region <math>4.46 \le x \le 6</math> is examined. The lower and upper bounds are computed to be <math>l = 1.11</math> at <math>x^L = 4.46</math> and <math>u = 1.147</math> at <math>x^U = 4.46</math>. Since <math>u - l = 0.04 < \epsilon</math>, the solution found is the optimum of the subregion, and branching is not necessary. The best solution found thus far does not need to be updated since <math>U < u</math>, so this region is discarded. The algorithm converges at the global optimum <math>x^* = -1.823</math>.

[[File:sBBexample4.png|200px|thumb|center|The third iteration.1]]

This example illustrates important features of the spatial branch-and-bound approach. First, the convex relaxation can be made tighter in each region. Second, in problems with multiple variables, the choice of the variable on which to branch is important. Third, the subregions are pruned when its lower bound is higher than the best solution found thus far.

==Applications==
Spatial branch-and-bound is used to solve a number of problems, including
*Pooling and Blending Problems: An example of a pooling/blending problem is the mixing of various types of crude oils with different qualities from different feeds. These streams are mixed in various pools. The parameters of the mixing are set to minimize cost while meeting quality requirement and quantity demand.
*Euclidean Location Problems: An example of this problem is finding the geographical positions of a plant that minimizes transportation costs.
*Kissing Number Problem: The objective of the kissing number problem is to find the maximum number of non-overlapping n-dimensional spheres in (n+1)-dimensional Euclidian space that can be arranged such that each is touching another. In 1-dimension, the kissing number is 2; in 2-dimension, 6; and in 3-dimension, 12.

==Conclusion==

Accurate models of real problems often involve nonconvex terms in either the objective function or the constraints. Nonconvex problems are one of the hardest to solve. Since there are multiple local optima solutions, it is difficult to find which solution is the best. Global optimization approaches, such as spatial branch-and-bound, are used to solve such problems. Other algorithms have been derived from the original spatial branch-and-bound. Examples include branch-and-reduce, α branch-and-bound, symbolic reformulation, branch-and-contract, and reduced spatial branch-and-bound.

==References==

#Liberti, Leo. "Introduction to global optimization." Lecture of Ecole Polytechnique, Palaiseau F 91128 (2008): 12.
#Pozo, C., et al. "A spatial branch-and-bound framework for the global optimization of kinetic models of metabolic networks." Industrial & Engineering Chemistry Research 50.9 (2010): 5225-5238.
#Stein, Oliver, Peter Kirst, and Paul Steuermann. "An Enhanced Spatial Branch-and-Bound Method in Global Optimization with Nonconvex Constraints." (2013).

File:Lse plot.png

2022-04-01T15:25:43Z

Btantisujjatham:

Lse plot

Logarithmic transformation

2022-04-01T15:24:58Z

Btantisujjatham: Created page with "Author: Hassan Ali (ChE 345 Spring 2015) Steward: Dajun Yue, Fengqi You '''Logarithmic transformation''' is a method used to change geometric programs into their convex form..."

Author: Hassan Ali (ChE 345 Spring 2015)

Steward: Dajun Yue, Fengqi You

'''Logarithmic transformation''' is a method used to change geometric programs into their convex forms. A '''geometric program''', or '''GP''', is a type of global optimization problem that concerns minimizing a subject to constraint functions so as to allow one to solve unique non-linear programming problems. All geometric programs contain functions called posynomials that are inherently non-convex. Due to this fact, solving geometric problems can be computationally intensive and finding a global optimum solution is not guaranteed. However, by creating a logarithmic transformation for a problem, one can solve for the globally optimum solution quicker and easier. A logarithmic transformation is not the only transformation which allows. One can also use an exponential transformation to obtain the same result. A logarithmic transformation can also be used on signomial programs, which are an extension of geometric programs.1,2,3

==Background==

===Geometric Programming===

Geometric programming was first discussed in the early 1960's as a class of optimization problem that could be solved with geometric inequalities. Duffin and his colleagues went on to formulate basic theories of geometric programming and its applications in 1967 in their seminal textbook ''Geometric Programming Theory and Application''.4 One of their insights into geometric programming was noticing that problems with highly non-linear constraints could be stated equivalently with a '''dual program'''. In other words, a global minimizing solution could be found by solving its corresponding dual maximization problem. This is because the dual constraints are linear. All one would need to do is change the geometric programming problem from its standard posynomial form into this "dual form" and solve using these now linear constraints.3 The answer could only be locally optimum as one was still dealing with '''non-linear programming (NLP)''' methods.

A standard GP problem takes the following form

:<math> \min ~f_0(x)</math>
:<math>s.t. ~~ f_i(x) \le 1, i = 1,...,m </math>
:::<math>h_j(x)=1, j=1,...,M</math>

where <math> f_0,...,f_m</math> are posynomials and <math> h_1,...,h_M</math> are monomials, and the variables '''x''' are all positive.

===Posynomials===

'''Posynomials''' are functions that are non-linear and consist of a set of constants multiplied to a series of multiple variables multiplied together. Often these variables are raised to various powers. More specifically, they are functions of the form:

<math> f(x_1,x_2,...,x_n)=\sum_{k=1}^K c_kx_1^{a_{1k}}...x_n^{a_{nk}} </math>

where the variables <math> x_i </math> and the coefficients <math> c_k </math> are positive, real numbers, and all of the exponents <math> a_{ik} </math> are real numbers.4

'''Monomials''' are a specific case of posynomial where there is no set of terms but rather only one term and its constant. A posynomial is then simply a sum of monomials. An example of a monomial is

<math> f(x_1,x_2)=400x_1^{3}x_2^{-2} </math>

and an example of a posynomial is

<math> f(x_1,x_2)=4x_1^{2}x_2^{4}+x_1^{3}x_2^{5} </math>

A posynomial is therefore not of the form

<math> f(x_1,x_2)=x_1-x_2 </math>

as this is a linear function.

===Geometric Programming Derivation===

GP problems are not always given and sometimes must be derived.5 Take the following example of a standard NLP problem

:<math> \max ~x/y</math>
:<math> s.t. ~~2\le x\le 3</math>
:::<math> ~x^2 + 3y/z\le \sqrt{y} </math>
:::<math> ~x/y = z^2 </math>

where the variables '''x,y, and z''' are positive. The standard GP form of this problem would be

:<math> \max ~x^{-y}</math>
:<math> s.t. ~~2x^-1\le 1, (1/3)x\le 1 </math>
:::<math> ~x^{2}y^{-0.5} + 3y^{0.5}z^{-1}\le 1 </math>
:::<math> ~xy^{-1}z^{-2} = 1 </math>

A standard GP problem then minimizes a posynomial function while constraining it to posynomial inequality constraints and monomial equality constraints. Posynomials are positive functions and contain log convexity. This is an important facet as it allows standard GP problems to undergo log transformations.1

==Logarithmic Transformation==

The Duffin "dual-program" method for solving GP problems is still in use but as logarithmic and exponential transformation methods were understood it became easier to simply use them to change the standard GP problem into a convex GP problem and solve using an interior point method. '''Interior point methods''' can solve GP problems very fast and robust as they require essentially no tuning of the parameters. Most importantly, the final solution obtained through this method is guaranteed to be the global optimum solution. Solving a standard GP problem is like solving an NLP problem. The only difference being that a GP problem is more constrained in its use of posynomials and monomials in its objective function and constraints. This makes standard GP problems more efficient to solve and they can be solved relatively quickly, but like NLP problems there is no guarantee the solution is globally optimum. In addition, an initial guess needs to be provided and parameters must be carefully chosen.5

Logarithmic transformations are therefore very helpful as they transform the standard GP form into its convex form based on a logarithmic change of variables and a logarithmic transformation of the objective and constraint functions. So instead of minimizing the objective <math>f_0</math>, the logarithm of <math>f_0</math> is minimized. The variable '''x''' is replaced with the its logarithm <math>y = \log x</math>. So now <math>x = e^y</math>. The inequality constraint is now <math>\log f_i \le 0</math> instead of <math>f_i \le 0</math> and the equality constraint is now <math>\log h_j=0</math> instead of <math>h_j=0</math>.5

Therefore the convex form of a GP problem is

:<math> \min ~\log f_0(e^y)</math>
:<math>s.t. ~~ \log f_i(e^y) \le 0, i = 1,...,m </math>
:::<math>\log h_j(e^y)=0, j=1,...,M</math>

where <math> f_0,...,f_m</math> are posynomials and <math> h_1,...,h_M</math> are monomials, and the variables '''y''' are all positive. This reformulation can occur as long as the constants in the function are positive.

===Methods===

[[File:Lse_plot.png|thumb|right|upright=1.7|The log-sum-exp function in <math>R^2</math>.]]

To see the transformation clearer, more '''detailed steps''' can be shown. If the objective function is a '''monomial''' of the format

<math>f_0(x) = cx_1^{a_1}x_2^{a_2}...x_n^{a_n}</math>

then the new objective function will be a function of a new variable '''y''' where <math>y = \log x</math> and <math>x = e^y</math>. It will also be a logarithm of the original objective function so

<math>f(y) = \log f_0(e^y)</math>

where

<math>\log f_0(e^y) = \log c +a_1\log x_1 +...+ a_n\log x_n = \log c +a_1y_1 +...+ a_ny_n</math>

which is an affine function of '''y'''

If the above objective function was instead a monomial equality constraint such that

<math>g(x) = cx_1^{a_1}x_2^{a_2}...x_n^{a_n}=1</math>

then the new equality constraint would be equal to zero and be simplified to a linear equation of the form

<math>-\log c = a_1y_1 +...+ a_ny_n</math>

Any GP with only monomials reduces to a linear program after a logarithmic transformation. Checking for linearity in a monomial case will confirm if the transformation was done correctly.5

If the GP has '''posynomials''' the problem becomes more complex. The log of multiple terms cannot be simplified beyond its logarithm form. So if the objective function is

<math>f_0(x) = \sum_{n=1}^N c_nx_n^{a_n}</math>

where '''c''' > 0, then the new objective function will be

<math>f(y) = \log f_0(e^y)</math>

where

<math>\log f_0(e^y) = \log \left ( \sum_{n=1}^N e^{a_ny_n+b_n} \right ) </math>

and where <math>b_n = \log c_n </math> for <math>n = 1,...,N</math>. The above can be written in a simpler form as

<math>\log f_0(e^y) = lse(Ay+b)</math>

where A is an N by n matrix with rows <math>a_1,...,a_N</math> and lse is the [https://inst.eecs.berkeley.edu/~ee127a/book/login/def_lse_fcn.html log-sum-exp function] which is convex.6

The reason the logarithmic transformation is done when an '''exponential transformation''' can be done in less steps is that the exponential function might take in large values. This may create numerical problems as well as lead to complications in optimization programs. Taking a logarithm allows for the recovery of smaller values, which gives the logarithmic transformation a distinct advantage.6

===Examples===
A few objective functions and constraints are provided in this section as illustrative examples.

'''EXAMPLE 1'''

:<math> \min ~f_0(x_1,x_2)=4x_1^{2}x_2^{4}</math>
:<math>s.t. ~~ x_1 \ge 0 </math>
:::<math> x_2 \ge 0 </math>

After a logarithmic transformation, this GP becomes:
:<math> \min ~f(y_1, y_2)= \log 4 + 2y_1 + 4y_2 </math>
:<math>s.t. ~~ y_1 \ge 0 </math>
:::<math> y_2 \ge 0 </math>

'''EXAMPLE 2'''

:<math> \min ~f_0(x_1,x_2)=x_1^{2}x_2 + x_1x_2</math>
:<math>s.t. ~~ x_1^{2}x_2^{-2} \ge 1 </math>

After a logarithmic transformation, this GP becomes:
:<math> \min ~f(y_1, y_2)= \log \left ( \sum_{n=1}^2 e^{a_ny_n} \right ) </math>
:<math>s.t. ~~ 2y_1 - 2y_2 \ge 0 </math>

'''EXAMPLE 3'''

:<math> \min ~f_0(x_1,x_2)=16x_1^{2}x_2^{5}</math>
:<math>s.t. ~~ \frac{x_1}{x_2} = 1 </math>

After a logarithmic transformation, this GP becomes:
:<math> \min ~f(y_1, y_2)= \log 16 + 2y_1 + 5y_2 </math>
:<math>s.t. ~~ y_1=y_2 </math>

To prove that the answers to the above examples are indeed convex, one could verify its convexity through a positive-definiteness test of the [http://en.wikipedia.org/wiki/Hessian_matrix/ Hessian]. The reformulations can also be tested in GAMS.

==Feasibility Analysis==

A standard GP problem can be infeasible i.e. the constraints are too "tight" and don't allow for a solution. Because the constraints from a standard GP problem hold over even when changed to its convex form via logarithmic transformation, an infeasible GP problem will always be infeasible regardless of convexity. This is why it is important to check for infeasibility prior to creating a linear transformation. This is done by undergoing a feasibility analysis and setting up the GP as follows:

:<math> \min ~s </math>
:<math>s.t. ~~f_i(x) \le s, i = 1,...,m</math>
:::<math>g_i(x)=1, i=1,...,p</math>
:::<math> s \ge 1</math>

As s nears a value of 1, the original problem nears feasibility. The final goal is to find a solution such that s=1 so that feasibility of the problem is confirmed.1

==Applications==

Applications of geometric programming include: 

<OL>

<LI>'''Electrical Engineering'''5
<UL>
<LI>Power Control
<LI>Wire Sizing
<LI>Routing
<LI>Optimal doping profile in semiconductor device engineering
<LI>Digital circuit gate sizing
</UL>
<LI>'''Chemical Engineering'''5
<UL>
<LI>Optimal reactor design
<LI>Mass Transfer Optimization
<LI>Kinetics
<LI>Maximizing reliability of reactors
</UL>
<LI>Other
<UL>
<LI>Transportation6
<LI>Economic Models
<LI>Inventory Models
</UL>
</OL>

A great example among these varied applications concerns optimal reactor design. The chemical systems within a reactor follow non convex kinetics equations. If one wanted to optimize a reactor for a given system with given reaction rates, one could do so easily with a logarithmic transformation. Take a reaction '''A + B to C''' with reaction rate <math>r = kC_AC_B</math>. The logarithmic transformation that would allow one to obtain a convex relation is <math>r^*=\log k + c_A +c_B</math> where <math>c = \log C</math>. With this new equation, you can optimize your reactor design for your given system.

== Conclusion ==

Geometric Programming is an application that can solve a varying degree of difficult optimization problems but due to the nature of their complexity, these problems cannot be solved easily. Solving geometric problems can be computationally intensive and finding a global optimum solution is not guaranteed. This is due to the fact that GPs involve posynomials and monomials which are by nature non-linear. However, by creating a '''logarithmic transformation''' for a problem, one can solve for the globally optimum solution quicker and easier because the GP is changed to its convex form. Logarithmic transformations offer an advantage over other transformations as it allows one to work with smaller values which are less problematic. Logarithmic transformations can be used wherever geometric programming is used which includes applications in chemical engineering and economics.

== References ==

# Chiang, M. (2005). ''Geometric Programming for Communication Systems'', Publishers, Inc., ISBN 1-933019-09-3.
# Duffin, R.J. (1970). "Linearizing Geometric Programs", ''SIAM Review'' '''12''' (2).
# Biswal, K.K., Ohja, A.K. (2010). "Posynomial Geometric Programming Problems with Multiple Parameters", ''Journal of Computing'' '''2''' (1).
# Duffin, R.J., Peterson, E.L., Zener, C.M. (1967). ''Geometric Programming Theory and Application'', John Wiley & Sons Inc., ISBN 978-0471223702.
# Boyd, S., Hassibi, A., Kim, S.J., Vandenberghe, L. (2007). "A Tutorial on Geometric Programming", ''Optim Eng'' '''8''': 67-127.
# Calafiore, G.C., El Ghaoui, L. (2014). ''Optimization Models and Applications'', Cambridge University Press., ISBN 978-1107050877.

File:Example11.png

2022-04-01T15:24:16Z

Btantisujjatham:

Example11

File:Example6 so.png

2022-04-01T15:23:50Z

Btantisujjatham:

Example6 so

File:Example4 so.png

2022-04-01T15:23:41Z

Btantisujjatham:

Example4 so

Subgradient optimization

2022-04-01T15:23:24Z

Btantisujjatham:

This web page is a duplicate of https://optimization.mccormick.northwestern.edu/index.php/Subgradient_optimization

Author Name: Aaron Anderson (ChE 345 Spring 2015) 
Steward: Dajun Yue and Fengqi You

[[File:Subderivative_illustration.png|right|thumb|A convex nondifferentiable function (blue) with red "subtangent" lines generalizing the derivative at the nondifferentiable point ''x''0.]]
'''Subgradient Optimization''' (or '''Subgradient Method''') is an iterative algorithm for minimizing convex functions, used predominantly in Nondifferentiable optimization for functions that are convex but nondifferentiable. It is often slower than Newton's Method when applied to convex differentiable functions, but can be used on convex nondifferentiable functions where Newton's Method will not converge. It was first developed by Naum Z. Shor in the Soviet Union in the 1960's.

==Introduction==
The '''Subgradient''' (related to Subderivative and Subdifferential) of a function is a way of generalizing or approximating the derivative of a convex function at nondifferentiable points. The definition of a subgradient is as follows: <math>g</math> is a subgradient of <math>f</math> at <math>x</math> if, for all <math>y</math>, the following is true: 
[[File:Subgradient.png|200px|center]]
An example of the subgradient of a nondifferentiable convex function <math>f</math> can be seen below:
[[File:Subgradient2.png|600px|center]]
Where <math>g_1</math> is a subgradient at point <math>x_1</math> and <math>g_2</math> and <math>g_3</math> are subgradients at point <math>x_2</math>. Notice that when the function is differentiable, such as at point <math>x_1</math>, the subgradient, <math>g_1</math>, just becomes the gradient to the function. Other important factors of the subgradient to note are that the subgradient gives a linear global underestimator of <math>f</math> and if <math>f</math> is convex, then there is at least one subgradient at every point in its domain. The set of all subgradients at a certain point is called the subdifferential, and is written as <math>\partial f(x_0)</math> at point <math>x_0</math>.

==The Subgradient Method==
Suppose <math>f:\mathbb{R}^n \to \mathbb{R}</math> is a convex function with domain <math>\mathbb{R}^n</math>. To minimize <math>f</math> the subgradient method uses the iteration: 
[[File:Submethod1.png|center]]
Where <math>k</math> is the number of iterations, <math>x^{(k)}</math> is the <math>k</math>th iterate, <math>g^{(x)}</math> is ''any'' subgradient at <math>x^{(k)}</math>, and <math>\alpha_k</math><math>(> 0)</math> is the <math>k</math>th step size. Thus, at each iteration of the subgradient method, we take a step in the direction of a negative subgradient. As explained above, when <math>f</math> is differentiable, <math>g^{(k)}</math> simply reduces to <math>\nabla</math><math>f(x^{(k)})</math>. It is also important to note that the subgradient method is not a descent method in that the new iterate is not always the best iterate. Thus we need some way to keep track of the best solution found so far, ''i.e.'' the one with the smallest function value. We can do this by, after each step, setting 
[[File:submethod2.png|200px|center]]
and setting <math>i_{\text{best}}^{(k)} = k</math> if <math>x^{(k)}</math> is the best (smallest) point found so far. Thus we have:
[[File:submethod3.png|237px|center]]
which gives the best objective value found in <math>k</math> iterations. Since this value is decreasing, it has a limit (which can be <math>-\infty</math>). 
 
An algorithm flowchart is provided below for the subgradient method: 
[[File:SMFlowsheet.png|400px|center]]

===Step size===
Several different step size rules can be used:
*'''Constant step size''': <math>\alpha_k = h</math> independent of <math>k</math>.
*'''Constant step length''': [[File:stepsize1.png]] This means that [[File:stepsize2.png]]
*'''Square summable but not summable''': These step sizes satisfy
:[[File:stepsize3.png]]
:One typical example is [[File:stepsize4.png]] where <math>a>0</math> and <math>b\ge0</math>.
*'''Nonsummable diminishing''': These step sizes satisfy
:[[File:stepsize5.png]]
:One typical example is [[File:stepsize6.png]] where <math>a>0</math>.
An important thing to note is that for all four of the rules given here, the step sizes are determined "off-line", or before the method is iterated. Thus the step sizes do not depend on preceding iterations. This "off-line" property of subgradient methods differs from the "on-line" step size rules used for descent methods for differentiable functions where the step sizes do depend on preceding iterations.

===Convergence Results===
There are different results on convergence for the subgradient method depending on the different step size rules applied.
For constant step size rules and constant step length rules the subgradient method is guaranteed to converge within some range of the optimal value. Thus:
[[File:convergence1.png|center]]
where <math>f^{*}</math> is the optimal solution to the problem and <math>\epsilon</math> is the aforementioned range of convergence. This means that the subgradient method finds a point within <math>\epsilon</math> of the optimal solution <math>f^{*}</math>. <math>\epsilon</math> is number that is a function of the step size parameter <math>h</math>, and as <math>h</math> decreases the range of convergence <math>\epsilon</math> also decreases, ''i.e.'' the solution of the subgradient method gets closer to <math>f^{*}</math> with a smaller step size parameter <math>h</math>.
For the diminishing step size rule and the square summable but not summable rule, the algorithm is guaranteed to converge to the optimal value or [[File:convergence2.png]] When the function <math>f</math> is differentiable the subgradient method with constant step size yields convergence to the optimal value, provided the parameter <math>h</math> is small enough.

==Example: Piecewise linear minimization==
Suppose we wanted to minimize the following piecewise linear convex function using the subgradient method: 
[[File:Example1.png|center]]
Since this is a linear programming problem finding a subgradient is simple: given <math>x</math> we can find an index <math>j</math> for which:
[[File:Example2_so.png|center]]
The subgradient in this case is <math>g=a_j</math>. Thus the iterative update is then:
[[File:Example3_so.png|center]]
Where <math>j</math> is chosen such to satisfy [[File:Example4_so.png]]
In order to apply the subgradient method to this problem all that is needed is some way to calculate [[File:Example5.png]] and the ability to carry out the iterative update. Even if the problem is dense and very large (where standard linear programming might fail), if there is some efficient way to calculate <math>f</math> then the subgradient method is a reasonable choice for algorithm.
Consider a problem with <math>n=10</math> variables and <math>m=100</math> terms and with data <math>a_i</math> and <math>b_i</math> generated from a normal distribution. We will consider all four of the step size rules mentioned above and will plot <math>\epsilon</math> or the difference between the optimal solution and the subgradient solution as a function of <math>k</math>, the nuber of iterations. 
For the constant step size rule [[File:Example6_so.png]] for several values of <math>h</math> the following plot was obtained: 
[[File:Example7.png]] 
For the constant step length rule [[File:Example8.png]] for several values of <math>h</math> the following plot was obtained: 
[[File:Example9.png]] 
The above figures reveal a trade-off: a larger step size parameter <math>h</math> gives a faster convergence but in the end gives a larger range of suboptimality so it is important to determine an <math>h</math> that will converge close to the optimal solution without taking a very large number of iterations. 
For the subgradient method using diminishing step size rules, both the nonsummable diminishing step size rule [[File:Example10.png]] (blue) and the square summable but not summable step size rule [[File:Example11.png]] (red) are plotted below for convergence: 
[[File:Example12.png]] 
This figure illustrates that both the nonsummable diminishing step size rule and the square summable but not summable step size rule show relatively fast and good convergence. The square summable but not summable step size rule shows less variation than the nonsummable diminishing step size rule but both show similar speed and convergence. 
Overall, all four step size rules can be used to get good convergence, so it is important to try different values for <math>h</math> in the constant step size and length rules and different formulas for the nonsummable diminishing step size rule and the square summable but not summable step size rule in order to get good convergence in the smallest amount of iterations possible.

==Conclusion==
The subgradient method is a very simple algorithm for minimizing convex nondifferentiable functions where newton's method and simple linear programming will not work. While the subgradient method has a disadvantage in that it can be much slower than interior-point methods such as Newton's method, it as the advantage of the memory requirement being often times much smaller than those of an interior-point or Newton method, which means it can be used for extremely large problems for which interior-point or Newton methods cannot be used. Morever, by combining the subgradient method with primal or dual decomposition techniques, it is sometimes possible to develop a simple distributed algorithm for a problem. The subgradient method is therefor an important method to know about for solving convex minimization problems that are nondifferentiable or very large.

==References==
1. Akgul, M. "Topics in Relaxation and Ellipsoidal Methods", volume 97 of Research Notes in Mathematics. Pitman, 1984. 
2. Bazaraa, M. S., Sherali, H. D. "On the choice of step size in subgradient optimization." European Journal of Operational Research 7.4, 1981 
3. Bertsekas, D. P. "Nonlinear Programming", (2nd edition), Athena Scientific, Belmont, MA, 1999. 
4. Goffin, J. L. "On convergence rates of subgradient optimization methods." Mathematical Programming 13.1, 1977. 
5. Shor, N. Z. "Minimization Methods for Non-differentiable Functions". Springer Series in Computational Mathematics. Springer, 1985. 
6. Shor, N. Z. "Nondifferentiable Optimization and Polynomial Problems". Nonconvex Optimization and its Applications. Kluwer, 1998.

File:Convergence2.png

2022-04-01T15:22:39Z

Btantisujjatham:

Convergence2

File:Example8.png

2022-04-01T15:20:21Z

Btantisujjatham:

Example8

File:Example5.png

2022-04-01T15:19:06Z

Btantisujjatham:

Example5

File:Example10.png

2022-04-01T15:18:45Z

Btantisujjatham:

Example10

File:Submethod3.png

2022-04-01T15:17:43Z

Btantisujjatham:

Submethod3

File:Submethod2.png

2022-04-01T15:17:35Z

Btantisujjatham:

Submethod2

Subgradient optimization

2022-04-01T15:15:20Z

Btantisujjatham:

This web page is a duplicate of https://optimization.mccormick.northwestern.edu/index.php/Subgradient_optimization

Author Name: Aaron Anderson (ChE 345 Spring 2015) 
Steward: Dajun Yue and Fengqi You

[[File:Subderivative_illustration.png|right|thumb|A convex nondifferentiable function (blue) with red "subtangent" lines generalizing the derivative at the nondifferentiable point ''x''0.]]
'''Subgradient Optimization''' (or '''Subgradient Method''') is an iterative algorithm for minimizing convex functions, used predominantly in Nondifferentiable optimization for functions that are convex but nondifferentiable. It is often slower than Newton's Method when applied to convex differentiable functions, but can be used on convex nondifferentiable functions where Newton's Method will not converge. It was first developed by Naum Z. Shor in the Soviet Union in the 1960's.

==Introduction==
The '''Subgradient''' (related to Subderivative and Subdifferential) of a function is a way of generalizing or approximating the derivative of a convex function at nondifferentiable points. The definition of a subgradient is as follows: <math>g</math> is a subgradient of <math>f</math> at <math>x</math> if, for all <math>y</math>, the following is true: 
[[File:Subgradient.png|200px|center]]
An example of the subgradient of a nondifferentiable convex function <math>f</math> can be seen below:
[[File:Subgradient2.png|600px|center]]
Where <math>g_1</math> is a subgradient at point <math>x_1</math> and <math>g_2</math> and <math>g_3</math> are subgradients at point <math>x_2</math>. Notice that when the function is differentiable, such as at point <math>x_1</math>, the subgradient, <math>g_1</math>, just becomes the gradient to the function. Other important factors of the subgradient to note are that the subgradient gives a linear global underestimator of <math>f</math> and if <math>f</math> is convex, then there is at least one subgradient at every point in its domain. The set of all subgradients at a certain point is called the subdifferential, and is written as <math>\partial f(x_0)</math> at point <math>x_0</math>.

==The Subgradient Method==
Suppose <math>f:\mathbb{R}^n \to \mathbb{R}</math> is a convex function with domain <math>\mathbb{R}^n</math>. To minimize <math>f</math> the subgradient method uses the iteration: 
[[File:Submethod1.png|center]]
Where <math>k</math> is the number of iterations, <math>x^{(k)}</math> is the <math>k</math>th iterate, <math>g^{(x)}</math> is ''any'' subgradient at <math>x^{(k)}</math>, and <math>\alpha_k</math><math>(> 0)</math> is the <math>k</math>th step size. Thus, at each iteration of the subgradient method, we take a step in the direction of a negative subgradient. As explained above, when <math>f</math> is differentiable, <math>g^{(k)}</math> simply reduces to <math>\nabla</math><math>f(x^{(k)})</math>. It is also important to note that the subgradient method is not a descent method in that the new iterate is not always the best iterate. Thus we need some way to keep track of the best solution found so far, ''i.e.'' the one with the smallest function value. We can do this by, after each step, setting 
[[File:submethod2.png|200px|center]]
and setting <math>i_{\text{best}}^{(k)} = k</math> if <math>x^{(k)}</math> is the best (smallest) point found so far. Thus we have:
[[File:submethod3.png|237px|center]]
which gives the best objective value found in <math>k</math> iterations. Since this value is decreasing, it has a limit (which can be <math>-\infty</math>). 
 
An algorithm flowchart is provided below for the subgradient method: 
[[File:SMFlowsheet.png|400px|center]]

===Step size===
Several different step size rules can be used:
*'''Constant step size''': <math>\alpha_k = h</math> independent of <math>k</math>.
*'''Constant step length''': [[File:stepsize1.png]] This means that [[File:stepsize2.png]]
*'''Square summable but not summable''': These step sizes satisfy
:[[File:stepsize3.png]]
:One typical example is [[File:stepsize4.png]] where <math>a>0</math> and <math>b\ge0</math>.
*'''Nonsummable diminishing''': These step sizes satisfy
:[[File:stepsize5.png]]
:One typical example is [[File:stepsize6.png]] where <math>a>0</math>.
An important thing to note is that for all four of the rules given here, the step sizes are determined "off-line", or before the method is iterated. Thus the step sizes do not depend on preceding iterations. This "off-line" property of subgradient methods differs from the "on-line" step size rules used for descent methods for differentiable functions where the step sizes do depend on preceding iterations.

===Convergence Results===
There are different results on convergence for the subgradient method depending on the different step size rules applied.
For constant step size rules and constant step length rules the subgradient method is guaranteed to converge within some range of the optimal value. Thus:
[[File:convergence1.png|center]]
where <math>f^{*}</math> is the optimal solution to the problem and <math>\epsilon</math> is the aforementioned range of convergence. This means that the subgradient method finds a point within <math>\epsilon</math> of the optimal solution <math>f^{*}</math>. <math>\epsilon</math> is number that is a function of the step size parameter <math>h</math>, and as <math>h</math> decreases the range of convergence <math>\epsilon</math> also decreases, ''i.e.'' the solution of the subgradient method gets closer to <math>f^{*}</math> with a smaller step size parameter <math>h</math>.
For the diminishing step size rule and the square summable but not summable rule, the algorithm is guaranteed to converge to the optimal value or [[File:convergence2.png]] When the function <math>f</math> is differentiable the subgradient method with constant step size yields convergence to the optimal value, provided the parameter <math>h</math> is small enough.

==Example: Piecewise linear minimization==
Suppose we wanted to minimize the following piecewise linear convex function using the subgradient method: 
[[File:Example1.png|center]]
Since this is a linear programming problem finding a subgradient is simple: given <math>x</math> we can find an index <math>j</math> for which:
[[File:Example2_so.png|center]]
The subgradient in this case is <math>g=a_j</math>. Thus the iterative update is then:
[[File:Example3_so.png|center]]
Where <math>j</math> is chosen such to satisfy [[File:Example4_so.png]]
In order to apply the subgradient method to this problem all that is needed is some way to calculate [[File:Example5.png]] and the ability to carry out the iterative update. Even if the problem is dense and very large (where standard linear programming might fail), if there is some efficient way to calculate <math>f</math> then the subgradient method is a reasonable choice for algorithm.
Consider a problem with <math>n=10</math> variables and <math>m=100</math> terms and with data <math>a_i</math> and <math>b_i</math> generated from a normal distribution. We will consider all four of the step size rules mentioned above and will plot <math>\epsilon</math> or the difference between the optimal solution and the subgradient solution as a function of <math>k</math>, the nuber of iterations. 
For the constant step size rule [[File:Example6.png]] for several values of <math>h</math> the following plot was obtained: 
[[File:Example7.png]] 
For the constant step length rule [[File:Example8.png]] for several values of <math>h</math> the following plot was obtained: 
[[File:Example9.png]] 
The above figures reveal a trade-off: a larger step size parameter <math>h</math> gives a faster convergence but in the end gives a larger range of suboptimality so it is important to determine an <math>h</math> that will converge close to the optimal solution without taking a very large number of iterations. 
For the subgradient method using diminishing step size rules, both the nonsummable diminishing step size rule [[File:Example10.png]] (blue) and the square summable but not summable step size rule [[File:Example11.png]] (red) are plotted below for convergence: 
[[File:Example12.png]] 
This figure illustrates that both the nonsummable diminishing step size rule and the square summable but not summable step size rule show relatively fast and good convergence. The square summable but not summable step size rule shows less variation than the nonsummable diminishing step size rule but both show similar speed and convergence. 
Overall, all four step size rules can be used to get good convergence, so it is important to try different values for <math>h</math> in the constant step size and length rules and different formulas for the nonsummable diminishing step size rule and the square summable but not summable step size rule in order to get good convergence in the smallest amount of iterations possible.

==Conclusion==
The subgradient method is a very simple algorithm for minimizing convex nondifferentiable functions where newton's method and simple linear programming will not work. While the subgradient method has a disadvantage in that it can be much slower than interior-point methods such as Newton's method, it as the advantage of the memory requirement being often times much smaller than those of an interior-point or Newton method, which means it can be used for extremely large problems for which interior-point or Newton methods cannot be used. Morever, by combining the subgradient method with primal or dual decomposition techniques, it is sometimes possible to develop a simple distributed algorithm for a problem. The subgradient method is therefor an important method to know about for solving convex minimization problems that are nondifferentiable or very large.

==References==
1. Akgul, M. "Topics in Relaxation and Ellipsoidal Methods", volume 97 of Research Notes in Mathematics. Pitman, 1984. 
2. Bazaraa, M. S., Sherali, H. D. "On the choice of step size in subgradient optimization." European Journal of Operational Research 7.4, 1981 
3. Bertsekas, D. P. "Nonlinear Programming", (2nd edition), Athena Scientific, Belmont, MA, 1999. 
4. Goffin, J. L. "On convergence rates of subgradient optimization methods." Mathematical Programming 13.1, 1977. 
5. Shor, N. Z. "Minimization Methods for Non-differentiable Functions". Springer Series in Computational Mathematics. Springer, 1985. 
6. Shor, N. Z. "Nondifferentiable Optimization and Polynomial Problems". Nonconvex Optimization and its Applications. Kluwer, 1998.

Subgradient optimization

2022-04-01T15:15:02Z

Btantisujjatham:

This web page is a duplicate of https://optimization.mccormick.northwestern.edu/index.php/Subgradient_optimization

Author Name: Aaron Anderson (ChE 345 Spring 2015) 
Steward: Dajun Yue and Fengqi You

[[File:Subderivative_illustration.png|right|thumb|A convex nondifferentiable function (blue) with red "subtangent" lines generalizing the derivative at the nondifferentiable point ''x''0.]]
'''Subgradient Optimization''' (or '''Subgradient Method''') is an iterative algorithm for minimizing convex functions, used predominantly in Nondifferentiable optimization for functions that are convex but nondifferentiable. It is often slower than Newton's Method when applied to convex differentiable functions, but can be used on convex nondifferentiable functions where Newton's Method will not converge. It was first developed by Naum Z. Shor in the Soviet Union in the 1960's.

==Introduction==
The '''Subgradient''' (related to Subderivative and Subdifferential) of a function is a way of generalizing or approximating the derivative of a convex function at nondifferentiable points. The definition of a subgradient is as follows: <math>g</math> is a subgradient of <math>f</math> at <math>x</math> if, for all <math>y</math>, the following is true: 
[[File:Subgradient.png|200px|center]]
An example of the subgradient of a nondifferentiable convex function <math>f</math> can be seen below:
[[File:Subgradient2.png|600px|center]]
Where <math>g_1</math> is a subgradient at point <math>x_1</math> and <math>g_2</math> and <math>g_3</math> are subgradients at point <math>x_2</math>. Notice that when the function is differentiable, such as at point <math>x_1</math>, the subgradient, <math>g_1</math>, just becomes the gradient to the function. Other important factors of the subgradient to note are that the subgradient gives a linear global underestimator of <math>f</math> and if <math>f</math> is convex, then there is at least one subgradient at every point in its domain. The set of all subgradients at a certain point is called the subdifferential, and is written as <math>\partial f(x_0)</math> at point <math>x_0</math>.

==The Subgradient Method==
Suppose <math>f:\mathbb{R}^n \to \mathbb{R}</math> is a convex function with domain <math>\mathbb{R}^n</math>. To minimize <math>f</math> the subgradient method uses the iteration: 
[[File:Submethod1.png|center]]
Where <math>k</math> is the number of iterations, <math>x^{(k)}</math> is the <math>k</math>th iterate, <math>g^{(x)}</math> is ''any'' subgradient at <math>x^{(k)}</math>, and <math>\alpha_k</math><math>(> 0)</math> is the <math>k</math>th step size. Thus, at each iteration of the subgradient method, we take a step in the direction of a negative subgradient. As explained above, when <math>f</math> is differentiable, <math>g^{(k)}</math> simply reduces to <math>\nabla</math><math>f(x^{(k)})</math>. It is also important to note that the subgradient method is not a descent method in that the new iterate is not always the best iterate. Thus we need some way to keep track of the best solution found so far, ''i.e.'' the one with the smallest function value. We can do this by, after each step, setting 
[[File:submethod2.png|200px|center]]
and setting <math>i_{\text{best}}^{(k)} = k</math> if <math>x^{(k)}</math> is the best (smallest) point found so far. Thus we have:
[[File:submethod3.png|237px|center]]
which gives the best objective value found in <math>k</math> iterations. Since this value is decreasing, it has a limit (which can be <math>-\infty</math>). 
 
An algorithm flowchart is provided below for the subgradient method: 
[[File:SMFlowsheet.png|400px|center]]

===Step size===
Several different step size rules can be used:
*'''Constant step size''': <math>\alpha_k = h</math> independent of <math>k</math>.
*'''Constant step length''': [[File:stepsize1.png]] This means that [[File:stepsize2.png]]
*'''Square summable but not summable''': These step sizes satisfy
:[[File:stepsize3.png]]
:One typical example is [[File:stepsize4.png]] where <math>a>0</math> and <math>b\ge0</math>.
*'''Nonsummable diminishing''': These step sizes satisfy
:[[File:stepsize5.png]]
:One typical example is [[File:stepsize6.png]] where <math>a>0</math>.
An important thing to note is that for all four of the rules given here, the step sizes are determined "off-line", or before the method is iterated. Thus the step sizes do not depend on preceding iterations. This "off-line" property of subgradient methods differs from the "on-line" step size rules used for descent methods for differentiable functions where the step sizes do depend on preceding iterations.

===Convergence Results===
There are different results on convergence for the subgradient method depending on the different step size rules applied.
For constant step size rules and constant step length rules the subgradient method is guaranteed to converge within some range of the optimal value. Thus:
[[File:convergence1.png|center]]
where <math>f^{*}</math> is the optimal solution to the problem and <math>\epsilon</math> is the aforementioned range of convergence. This means that the subgradient method finds a point within <math>\epsilon</math> of the optimal solution <math>f^{*}</math>. <math>\epsilon</math> is number that is a function of the step size parameter <math>h</math>, and as <math>h</math> decreases the range of convergence <math>\epsilon</math> also decreases, ''i.e.'' the solution of the subgradient method gets closer to <math>f^{*}</math> with a smaller step size parameter <math>h</math>.
For the diminishing step size rule and the square summable but not summable rule, the algorithm is guaranteed to converge to the optimal value or [[File:convergence2.png]] When the function <math>f</math> is differentiable the subgradient method with constant step size yields convergence to the optimal value, provided the parameter <math>h</math> is small enough.

==Example: Piecewise linear minimization==
Suppose we wanted to minimize the following piecewise linear convex function using the subgradient method: 
[[File:Example1_so.png|center]]
Since this is a linear programming problem finding a subgradient is simple: given <math>x</math> we can find an index <math>j</math> for which:
[[File:Example2_so.png|center]]
The subgradient in this case is <math>g=a_j</math>. Thus the iterative update is then:
[[File:Example3.png|center]]
Where <math>j</math> is chosen such to satisfy [[File:Example4.png]]
In order to apply the subgradient method to this problem all that is needed is some way to calculate [[File:Example5.png]] and the ability to carry out the iterative update. Even if the problem is dense and very large (where standard linear programming might fail), if there is some efficient way to calculate <math>f</math> then the subgradient method is a reasonable choice for algorithm.
Consider a problem with <math>n=10</math> variables and <math>m=100</math> terms and with data <math>a_i</math> and <math>b_i</math> generated from a normal distribution. We will consider all four of the step size rules mentioned above and will plot <math>\epsilon</math> or the difference between the optimal solution and the subgradient solution as a function of <math>k</math>, the nuber of iterations. 
For the constant step size rule [[File:Example6.png]] for several values of <math>h</math> the following plot was obtained: 
[[File:Example7.png]] 
For the constant step length rule [[File:Example8.png]] for several values of <math>h</math> the following plot was obtained: 
[[File:Example9.png]] 
The above figures reveal a trade-off: a larger step size parameter <math>h</math> gives a faster convergence but in the end gives a larger range of suboptimality so it is important to determine an <math>h</math> that will converge close to the optimal solution without taking a very large number of iterations. 
For the subgradient method using diminishing step size rules, both the nonsummable diminishing step size rule [[File:Example10.png]] (blue) and the square summable but not summable step size rule [[File:Example11.png]] (red) are plotted below for convergence: 
[[File:Example12.png]] 
This figure illustrates that both the nonsummable diminishing step size rule and the square summable but not summable step size rule show relatively fast and good convergence. The square summable but not summable step size rule shows less variation than the nonsummable diminishing step size rule but both show similar speed and convergence. 
Overall, all four step size rules can be used to get good convergence, so it is important to try different values for <math>h</math> in the constant step size and length rules and different formulas for the nonsummable diminishing step size rule and the square summable but not summable step size rule in order to get good convergence in the smallest amount of iterations possible.

==Conclusion==
The subgradient method is a very simple algorithm for minimizing convex nondifferentiable functions where newton's method and simple linear programming will not work. While the subgradient method has a disadvantage in that it can be much slower than interior-point methods such as Newton's method, it as the advantage of the memory requirement being often times much smaller than those of an interior-point or Newton method, which means it can be used for extremely large problems for which interior-point or Newton methods cannot be used. Morever, by combining the subgradient method with primal or dual decomposition techniques, it is sometimes possible to develop a simple distributed algorithm for a problem. The subgradient method is therefor an important method to know about for solving convex minimization problems that are nondifferentiable or very large.

==References==
1. Akgul, M. "Topics in Relaxation and Ellipsoidal Methods", volume 97 of Research Notes in Mathematics. Pitman, 1984. 
2. Bazaraa, M. S., Sherali, H. D. "On the choice of step size in subgradient optimization." European Journal of Operational Research 7.4, 1981 
3. Bertsekas, D. P. "Nonlinear Programming", (2nd edition), Athena Scientific, Belmont, MA, 1999. 
4. Goffin, J. L. "On convergence rates of subgradient optimization methods." Mathematical Programming 13.1, 1977. 
5. Shor, N. Z. "Minimization Methods for Non-differentiable Functions". Springer Series in Computational Mathematics. Springer, 1985. 
6. Shor, N. Z. "Nondifferentiable Optimization and Polynomial Problems". Nonconvex Optimization and its Applications. Kluwer, 1998.

Subgradient optimization

2022-04-01T15:14:45Z

Btantisujjatham:

This web page is a duplicate of https://optimization.mccormick.northwestern.edu/index.php/Subgradient_optimization

Author Name: Aaron Anderson (ChE 345 Spring 2015) 
Steward: Dajun Yue and Fengqi You

[[File:Subderivative_illustration.png|right|thumb|A convex nondifferentiable function (blue) with red "subtangent" lines generalizing the derivative at the nondifferentiable point ''x''0.]]
'''Subgradient Optimization''' (or '''Subgradient Method''') is an iterative algorithm for minimizing convex functions, used predominantly in Nondifferentiable optimization for functions that are convex but nondifferentiable. It is often slower than Newton's Method when applied to convex differentiable functions, but can be used on convex nondifferentiable functions where Newton's Method will not converge. It was first developed by Naum Z. Shor in the Soviet Union in the 1960's.

==Introduction==
The '''Subgradient''' (related to Subderivative and Subdifferential) of a function is a way of generalizing or approximating the derivative of a convex function at nondifferentiable points. The definition of a subgradient is as follows: <math>g</math> is a subgradient of <math>f</math> at <math>x</math> if, for all <math>y</math>, the following is true: 
[[File:Subgradient.png|200px|center]]
An example of the subgradient of a nondifferentiable convex function <math>f</math> can be seen below:
[[File:Subgradient2.png|600px|center]]
Where <math>g_1</math> is a subgradient at point <math>x_1</math> and <math>g_2</math> and <math>g_3</math> are subgradients at point <math>x_2</math>. Notice that when the function is differentiable, such as at point <math>x_1</math>, the subgradient, <math>g_1</math>, just becomes the gradient to the function. Other important factors of the subgradient to note are that the subgradient gives a linear global underestimator of <math>f</math> and if <math>f</math> is convex, then there is at least one subgradient at every point in its domain. The set of all subgradients at a certain point is called the subdifferential, and is written as <math>\partial f(x_0)</math> at point <math>x_0</math>.

==The Subgradient Method==
Suppose <math>f:\mathbb{R}^n \to \mathbb{R}</math> is a convex function with domain <math>\mathbb{R}^n</math>. To minimize <math>f</math> the subgradient method uses the iteration: 
[[File:Submethod1.png|center]]
Where <math>k</math> is the number of iterations, <math>x^{(k)}</math> is the <math>k</math>th iterate, <math>g^{(x)}</math> is ''any'' subgradient at <math>x^{(k)}</math>, and <math>\alpha_k</math><math>(> 0)</math> is the <math>k</math>th step size. Thus, at each iteration of the subgradient method, we take a step in the direction of a negative subgradient. As explained above, when <math>f</math> is differentiable, <math>g^{(k)}</math> simply reduces to <math>\nabla</math><math>f(x^{(k)})</math>. It is also important to note that the subgradient method is not a descent method in that the new iterate is not always the best iterate. Thus we need some way to keep track of the best solution found so far, ''i.e.'' the one with the smallest function value. We can do this by, after each step, setting 
[[File:submethod2.png|200px|center]]
and setting <math>i_{\text{best}}^{(k)} = k</math> if <math>x^{(k)}</math> is the best (smallest) point found so far. Thus we have:
[[File:submethod3.png|237px|center]]
which gives the best objective value found in <math>k</math> iterations. Since this value is decreasing, it has a limit (which can be <math>-\infty</math>). 
 
An algorithm flowchart is provided below for the subgradient method: 
[[File:SMFlowsheet.png|400px|center]]

===Step size===
Several different step size rules can be used:
*'''Constant step size''': <math>\alpha_k = h</math> independent of <math>k</math>.
*'''Constant step length''': [[File:stepsize1.png]] This means that [[File:stepsize2.png]]
*'''Square summable but not summable''': These step sizes satisfy
:[[File:stepsize3.png]]
:One typical example is [[File:stepsize4.png]] where <math>a>0</math> and <math>b\ge0</math>.
*'''Nonsummable diminishing''': These step sizes satisfy
:[[File:stepsize5.png]]
:One typical example is [[File:stepsize6.png]] where <math>a>0</math>.
An important thing to note is that for all four of the rules given here, the step sizes are determined "off-line", or before the method is iterated. Thus the step sizes do not depend on preceding iterations. This "off-line" property of subgradient methods differs from the "on-line" step size rules used for descent methods for differentiable functions where the step sizes do depend on preceding iterations.

===Convergence Results===
There are different results on convergence for the subgradient method depending on the different step size rules applied.
For constant step size rules and constant step length rules the subgradient method is guaranteed to converge within some range of the optimal value. Thus:
[[File:convergence1.png|center]]
where <math>f^{*}</math> is the optimal solution to the problem and <math>\epsilon</math> is the aforementioned range of convergence. This means that the subgradient method finds a point within <math>\epsilon</math> of the optimal solution <math>f^{*}</math>. <math>\epsilon</math> is number that is a function of the step size parameter <math>h</math>, and as <math>h</math> decreases the range of convergence <math>\epsilon</math> also decreases, ''i.e.'' the solution of the subgradient method gets closer to <math>f^{*}</math> with a smaller step size parameter <math>h</math>.
For the diminishing step size rule and the square summable but not summable rule, the algorithm is guaranteed to converge to the optimal value or [[File:convergence2.png]] When the function <math>f</math> is differentiable the subgradient method with constant step size yields convergence to the optimal value, provided the parameter <math>h</math> is small enough.

==Example: Piecewise linear minimization==
Suppose we wanted to minimize the following piecewise linear convex function using the subgradient method: 
[[File:Example1.png|center]]
Since this is a linear programming problem finding a subgradient is simple: given <math>x</math> we can find an index <math>j</math> for which:
[[File:Example2_so.png|center]]
The subgradient in this case is <math>g=a_j</math>. Thus the iterative update is then:
[[File:Example3_so.png|center]]
Where <math>j</math> is chosen such to satisfy [[File:Example4.png]]
In order to apply the subgradient method to this problem all that is needed is some way to calculate [[File:Example5.png]] and the ability to carry out the iterative update. Even if the problem is dense and very large (where standard linear programming might fail), if there is some efficient way to calculate <math>f</math> then the subgradient method is a reasonable choice for algorithm.
Consider a problem with <math>n=10</math> variables and <math>m=100</math> terms and with data <math>a_i</math> and <math>b_i</math> generated from a normal distribution. We will consider all four of the step size rules mentioned above and will plot <math>\epsilon</math> or the difference between the optimal solution and the subgradient solution as a function of <math>k</math>, the nuber of iterations. 
For the constant step size rule [[File:Example6.png]] for several values of <math>h</math> the following plot was obtained: 
[[File:Example7.png]] 
For the constant step length rule [[File:Example8.png]] for several values of <math>h</math> the following plot was obtained: 
[[File:Example9.png]] 
The above figures reveal a trade-off: a larger step size parameter <math>h</math> gives a faster convergence but in the end gives a larger range of suboptimality so it is important to determine an <math>h</math> that will converge close to the optimal solution without taking a very large number of iterations. 
For the subgradient method using diminishing step size rules, both the nonsummable diminishing step size rule [[File:Example10.png]] (blue) and the square summable but not summable step size rule [[File:Example11.png]] (red) are plotted below for convergence: 
[[File:Example12.png]] 
This figure illustrates that both the nonsummable diminishing step size rule and the square summable but not summable step size rule show relatively fast and good convergence. The square summable but not summable step size rule shows less variation than the nonsummable diminishing step size rule but both show similar speed and convergence. 
Overall, all four step size rules can be used to get good convergence, so it is important to try different values for <math>h</math> in the constant step size and length rules and different formulas for the nonsummable diminishing step size rule and the square summable but not summable step size rule in order to get good convergence in the smallest amount of iterations possible.

==Conclusion==
The subgradient method is a very simple algorithm for minimizing convex nondifferentiable functions where newton's method and simple linear programming will not work. While the subgradient method has a disadvantage in that it can be much slower than interior-point methods such as Newton's method, it as the advantage of the memory requirement being often times much smaller than those of an interior-point or Newton method, which means it can be used for extremely large problems for which interior-point or Newton methods cannot be used. Morever, by combining the subgradient method with primal or dual decomposition techniques, it is sometimes possible to develop a simple distributed algorithm for a problem. The subgradient method is therefor an important method to know about for solving convex minimization problems that are nondifferentiable or very large.

==References==
1. Akgul, M. "Topics in Relaxation and Ellipsoidal Methods", volume 97 of Research Notes in Mathematics. Pitman, 1984. 
2. Bazaraa, M. S., Sherali, H. D. "On the choice of step size in subgradient optimization." European Journal of Operational Research 7.4, 1981 
3. Bertsekas, D. P. "Nonlinear Programming", (2nd edition), Athena Scientific, Belmont, MA, 1999. 
4. Goffin, J. L. "On convergence rates of subgradient optimization methods." Mathematical Programming 13.1, 1977. 
5. Shor, N. Z. "Minimization Methods for Non-differentiable Functions". Springer Series in Computational Mathematics. Springer, 1985. 
6. Shor, N. Z. "Nondifferentiable Optimization and Polynomial Problems". Nonconvex Optimization and its Applications. Kluwer, 1998.

File:Subgradient.png

2022-04-01T15:14:12Z

Btantisujjatham:

Subgradient

File:Stepsize6.png

2022-04-01T15:13:57Z

Btantisujjatham:

Stepsize6

File:Stepsize5.png

2022-04-01T15:13:49Z

Btantisujjatham:

Stepsize5

File:Stepsize4.png

2022-04-01T15:13:30Z

Btantisujjatham:

Stepsize4

File:Stepsize3.png

2022-04-01T15:13:12Z

Btantisujjatham:

Stepsize3

File:Stepsize2.png

2022-04-01T15:13:05Z

Btantisujjatham:

Stepsize2

File:Stepsize1.png

2022-04-01T15:12:56Z

Btantisujjatham:

Stepsize1

File:Example3 so.png

2022-04-01T15:12:37Z

Btantisujjatham:

Example3 so

File:Example2 so.png

2022-04-01T15:12:28Z

Btantisujjatham:

Example2 so

File:Convergence1.png

2022-04-01T15:11:35Z

Btantisujjatham:

Convergence1

File:Example1.png

2022-04-01T15:11:24Z

Btantisujjatham:

Example1

File:Submethod1.png

2022-04-01T15:08:23Z

Btantisujjatham:

Submethod1