Line search methods

From Cornell University Computational Optimization Open Textbook - Optimization Wiki
Jump to navigation Jump to search

Authors: Lihe Cao, Zhengyi Sui, Jiaqi Zhang, Yuqing Yan, and Yuhui Gu (6800 Fall 2021).

Introduction

Line search method is an iterative approach to find a local minimum of a multidimensional nonlinear function using the function's gradients. It aims at finding an acceptable step length witch satisfies certain standard conditions. Line search and trust-region method are two fundamental strategies for locating the new iterate given the current point. [1] With the ability to solve the unconstrained optimization problem, line search is widely used in many cases including machine learning, game theory and other fields.

Generic Line Search Method

Basic Algorithm

  • Pick the starting point
  • Repeat the following steps until coverges to a local minimum :
    • Choose a descent direction starting at , defined as: for
    • Find a step length so that
    • Set

Search Direction for Line Search

The direction of the line search should be chosen to make decrease moving from point to , and it is usually related to the gradient . The most obvious direction is the because it is the one to make decreases most rapidly. This claim can be verified by Taylor's theorem:

where

The rate of change in along the direction at is the coefficient of . Therefore, the unit direction of most rapid decrease is the solution to

.

is the solution and this direction is orthogonal to the contours of the function. In the following sections, this will be used as the default direction of the line search.

However, the steepest descent direction is not the most effient, as the steepest descent method does not pass the Rosenbrock test (see Figure 1). [2] Carefully desgined descent directions deviating from the steepest direction can be used in practice to produce faster convergence. [3]

Step Length

The step length is a non-negative value such that . When choosing the step length , there is a trade off between giving a substantial reduction of and not spending too much time finding the solution. If is too large, then the step will overshoot, while if the step length is too small, it is time consuming to find the convergent point. An exact line search and inexact line search are needed to find the value of and more detail about these approaches will be introduced in the next section.

Convergence

For a line search algorithm to be reliable, it should be globally convergent, that is the gradient norms, , should converge to zero with each iteration, i.e., .

It can be shown from Zoutendijk's theorem [1] that if the line search algorithm satisfies (weak) Wolfe's conditions (similar results also hold for strong Wolfe and Goldstein conditions) and has a search direction that makes an angle with the steepest descent direction that is bounded away from 90°, the algorithm is globally convergent.

Zoutendijk's theorem states that, given an iteration where is the descent direction and is the step length that satisfies (weak) Wolfe conditions, if the objective is bounded below on and is continuously differentiable in an open set containing the level set , where is the starting point of the iteration, and the gradient is Lipschitz continuous on , then

,

where is the angle between and the steepest descent direction .

The Zoutendijk condition above implies that

,

by the n-th term divergence test. Hence, if the algorithm chooses a search direction that is bounded away from 90° relative to the gradient, i.e., given ,

,

it follows that

.

However, the Zoutendijk condition doesn't guarantee convergence to a local minimum but only stationary points. Hence, additional conditions on the search direction is necessary, such as finding a direction of negative curvature whenever possible, to avoid converging to a nonminimizing stationary point.

Exact Search

Steepest Descent Method

Given the intuition that the negative gradient can be an effective search direction, steepest descent follows the idea and establishes a systematic method for minimizing the objective function. Setting as the direction, steepest descent computes the step-length by minimizing a single-variable objective function. More specifically, the steps of Steepest Descent Method are as follows:

Steepest Descent Algorithm

Set a starting point

Set a convergence criterium

Set

Set the maximum iteration

While :

If :
Break
End if

End while

Return ,

Figure 1. Steepest descent does not produce convergence on the Rosenbrock function. [4]

One advantage of the steepest descent method is the convergency. For a steepest descent method, it converges to a local minimum from any starting point. [4]

Theorem: Global Convergence of Steepest Descent

Let the gradient of be uniformly Lipschitz continuous on . Then, for the iterates with steepest-descent search directions, one of the following situations occurs:

  • for some finite

However, steepest descent has disadvantages in that the convergence is always slow and numerically may be not convergent (see Figure 1.)

Steepest descent method is a special case of gradient descent in that the step-length is analytically defined. However, step-lengths cannot always be computed analytically; in this case, inexact methods can be used to optimize at each iteration.

Inexact Search

While minimizing an objective function using numeric methods, in each iteration, the updated objective is , which is a function of after fixing the search direction. The goal is to minimize this objective with respect to . However, if one wants to solve for the exact minimum in each iteration, it could be computationally expensive and the algorithm will be time-consuming. Therefore, in practice, it is easier solve the subproblem

numerically and find a reasonable step length instead, which will decrease the objective function. That is, satisfies . However, convergence to the function's minimum cannot be guaranteed, so Wolfe or Goldstein conditions need to be applied when searching for an acceptable step length. [1]

Wolfe Conditions

This condition is proposed by Phillip Wolfe in 1969. [5] It provide an efficient way of choosing a step length that decreases the objective function sufficiently. It consists of two conditions: Armijo (sufficient decrease) condition and the curvature condition.

Figure 2. Armijo Condition. [1]

Armijo (Sufficient Decrease) Condition

,

where and is often chosen to be of a small order of magnitude around . This condition ensures the computed step length can reduces the objective function sufficiently. Only using this condition, however, it cannot be guaranteed that will converge in a reasonable number of iterations, since Armijo condition is always satisfied with step length being small enough. Therefore, the second condition below needs to be paired with the sufficient decrease condition to keep from being too short.

Figure 3. Curvature Condition. [1]

Curvature Condition

,

where is much greater than and is typically on the order of 0.1. This condition ensures a sufficient increase of the gradient.

This left hand side of the curvature condition is simply the derivative of , thus ensuring to be in the vicinity of a stationary point of .

Figure 4. Goldstein Condition. [1]

Strong Wolfe Curvature Condition

The (weak) Wolfe conditions can result in an value that is not close to the minimizer of . The (weak) Wolfe conditions can be modified by using the following condition called Strong Wolfe condition, which writes the curvature condition in absolute values

.

The strong Wolfe curvature condition restricts the slope of from getting too positive, hence excluding points far away from the stationary point of .

Goldstein Conditions

Another condition to find an appropriate step length is called Goldstein conditions:

where . The Goldstein condition is quite similar with the Wolfe condition in that, its second inequality ensures that the step length will decrease the objective function sufficiently and its first inequality keep from being too short. In comparison with Wolfe condition, one disadvantage of Goldstein condition is that the first inequality of the condition might exclude all minimizers of function. However, usually it is not a fatal problem as long as the objective decreases in the direction of convergence. As a short conclusion, the Goldstein and Wolfe conditions have quite similar convergence theories. Compared to the Wolfe conditions, the Goldstein conditions are often used in Newton-type methods but are not well suited for quasi-Newton methods that maintain a positive definite Hessian approximation.

Backtracking Line Search

The backtracking method is often used to find the appropriate step length and terminate line search based. The backtracking method starts with a relatively large initial step length (e.g., 1 for Newton method), then iteratively shrinking it by a contraction factor until the Armijo (sufficient decrease) condition is satisfied. The advantage of this approach is that the curvature condition needs not be considered, and the step length found at each line search iterate is ensured to be short enough to satisfy sufficient decrease but large enough to still allow the algorithm to make reasonable progress towards convergence.

The backtracking algorithm involves control parameters and , and it is roughly as follows:

Choose
Set
While
End while
Return

Numeric Example

As an example, we can use line search method to solve the following unconstrained optimization problem:

.


First iteration:

Starting from , we have

Taking partial derivative with respect to and set it to zero Therefore,


Second iteration:

Given , we have Then

Taking partial derivative with respect to and set it to zero

Then we can get

Therefore,


Third iteration:

Given , we have have

Then

Taking partial derivative with respect to

Then we can get

Therefore,


Fourth iteration:

Given , we have

Then

Taking partial derivative with respect to ,

Then we can get

Therefore,


Fifth iteration:

Given , we have

Then

Taking partial derivative with respect to ,

Then we can get

Therefore, and

Check to see if the convergence satisfied evaluated :

.


Since 0.0565 is relatively small and is close enough to zero, the line search is converged. The derived optimal solution is and the optimal value is -1.25.

Applications

A common application of line search method is in minimizing the loss function in training machine learning models. For example, when training a classification model with logistic regression, gradient descent algorithm (GD), which is a classic method of line search, can be used to minimize the logistic loss and compute the coefficients by iteration till the loss function reaches converges to a local minimum. [6] An alternative of gradient descent in machine learning domain is stochastic gradient descent (SGD). The difference is on computation expense that instead of using all training set to compute the descent, SGD simply sample one data point to compute the descent. The application of SGD reduces the computation costs greatly in large dataset compared to gradient descent.

Line search can also be applied in adaptive filtering that choose the convergence parameter so that the updated filter vector minimizes the sum of squared errors on a linear manifold. [7]

Conclusion

Line Search is a useful strategy to solve unconstrained optimization problems. The success of the line search algorithm depends on careful consideration of the choice of both the direction and the step size .This page has introduced the basic algorithm firstly, and then includes the exact search and inexact search. The exact search contains the steepest descent, and the inexact search covers the Wolfe and Goldstein conditions, backtracking, and Zoutendijk's theorem. More approaches to solve unconstrained optimization problems can be found in trust-region methods, conjugate gradient methods, Newton's method and Quasi-Newton method.

Reference

  1. 1.0 1.1 1.2 1.3 1.4 1.5 J. Nocedal and S. Wright, Numerical Optimization, Springer Science, 1999, p. 30-44.
  2. "Rosenbrock Function," Cornell University.
  3. R. Fletcher and M. Powell, "A Rapidly Convergent Descent Method for Minimization," The Computer Journal, vol. 6, no. 2, pp. 163–168, 1963.
  4. 4.0 4.1 R. Hauser, "Line Search Methods for Unconstrained Optimization," Oxford University Computing Laboratory, 2007.
  5. P. Wolfe, "Convergence Conditions for Ascent Methods," SIAM Review, vol. 11, no. 2, pp. 226–235, 1969.
  6. A. A. Tokuç, "Gradient Descent Equation in Logistic Regression."
  7. C. E. Davila, "Line Search Algorithms for Adaptive Filtering," IEEE Transactions on Signal Processing, vol. 41, no. 7, pp. 2490-2494, 1993.