Line search methods

Authors: Lihe Cao, Zhengyi Sui, Jiaqi Zhang, Yuqing Yan, and Yuhui Gu (6800 Fall 2021).

Introduction

When solving unconstrained optimization problems, the user need to supply a starting point for all algorithms. With the initial starting point, $$ x_0 $$ , optimization algorithms generate a sequence of iterates $\{x_k\}_{k=0}^{\infty}$ which terminates when an approximated solution has been achieved or no more progress can be made. Line Search is one of the two fundamental strategies for locating the new $x_{k+1}$ given the current point.

Generic Line Search Method

Basic Algorithm

Pick an initial iterate point $$ x_0 $$
Repeat the following steps until $ x_k $ converge:
- Choose a descent direction $$ p_k $$ starting at $$ x_k $$ , defined as if $\nabla f_k \not =0$ , then $\nabla f_{k}^\top p_{k}<0$
- Calculate a decent step length $\alpha>0$ so that $f(x_k+\alpha_kp_k)<f_k$
- Set $x_{k+1}=x_k+\alpha_k p_k$

Search Direction for Line Search

The direction of the line search should be chosen to make $$ f $$ decrease moving from point $$ x_k $$ to $x_{k+1}$ . The most obvious direction is the $- \nabla f_k$ because it is the one to make $$ f $$ decreases most rapidly. We can verify the claim by Taylor's theorem:

$f(x_k+\alpha)=f(x_k)+\alpha p^\top\nabla f_k+\frac{1}{2}\alpha^2p^Tf(x_k+tp)p$ where $t\in (0,\alpha)$

The rate of change in $$ f $$ along the direction $$ p $$ at $$ x_k $$ is the coefficient of $\alpha$ . Therefore, the unit direction $$ p $$ of most rapid decrease is the solution to

$\min\ p^\top\nabla f_k$

$\text{s.t.}\ \ ||p||=1$ .

$p=\frac{-\nabla f_k}{||\nabla f_k||}$ is the solution and this direction is orthogonal to the contours of the function. In the following sections, we will use this as the default direction of the line search.

Step Length

The step length is a non-negative value such that $f(x_k+\alpha_k p_k)<f_k$ . When choosing the step length $\alpha_k$ , we need to trade off between giving a substantial reduction of $$ f $$ and not spending too much time finding the solution. If $\alpha_k$ is too large, then the step will overshoot, while if the step length is too small, it is time consuming to find the convergent point. We have exact line search and inexact line search to find the value of $\alpha$ and more detail about these approaches will be introduced in the next section.

Convergence

For a line search algorithm to be reliable, it should be globally convergent, that is the gradient norms, $||\nabla f(x_{k})||$ , should converge to zero with each iteration, i.e., $\lim_{k\to\infty} ||\nabla f(x_{k})|| = 0$ .

It can be shown from Zoutendijk's theorem ^[1] that if the line search algorithm satisfies (weak) Wolfe's conditions (similar results also hold for strong Wolfe and Goldstein conditions) and has a search direction that makes an angle with the steepest descent direction that is bounded away from 90°, the algorithm is globally convergent.

Zoutendijk's theorem states that, given an iteration where $$ p_k $$ is the descent direction and $\alpha_k$ is the step length that satisfies (weak) Wolfe conditions, if the objective $$ f $$ is bounded below on $\mathbb{R}^{n}$ and is continuously differentiable in an open set $\mathcal{N}$ containing the level set $\mathcal{L}:=\{x\ |\ f(x)\leq f(x_0)\}$ , where $$ x_0 $$ is the starting point of the iteration, and the gradient $\nabla f$ is Lipschitz continuous on $\mathcal{N}$ , then

$\sum_{k=0}^{\infty}\cos^{2}\theta_{k}||\nabla f_{k}||^2 < \infty$ ,

where $\theta_{k}$ is the angle between $$ p_k $$ and the steepest descent direction $-\nabla f(x_{k})$ .

The Zoutendijk condition above implies that

$\lim_{k\to\infty}\cos^{2}\theta_{k}||\nabla f_{k}||^2=0$ ,

by the n-th term divergence test. Hence, if the algorithm chooses a search direction that is bounded away from $90^\circ$ relative to the gradient, i.e., given $\epsilon>0$ ,

$\cos\theta_{k}\geq\epsilon>0,\ \forall k$ ,

it follows that

$\lim_{k\to\infty}||\nabla f_{k}||=0$ .

However, the Zoutendijk condition doesn't guarantee convergence to a local minimum but only stationary points. Hence, additional conditions on the search direction is necessary, such as finding a direction of negative curvature, to prevent the iteration from converging to a nonminimizing stationary point.

Exact Search

Steepest Descent Method

Given the intuition that the negative gradient $- \nabla f_k$ can be an effective search direction, steepest descent follows the idea and establishes a systematic method for minimizing the objective function. Setting $- \nabla f_k$ as the direction, steepest descent computes the step-length $\alpha_k$ by minimizing a single-variable objective function. More specifically, the steps of Steepest Descent Method are as follows.

Steepest Descent Algorithm

Set a starting point $$ x_0 $$
Set a convergence criterium $\epsilon>0$
Set $$ k = 0 $$
Set the maximum iteration $$ N $$
While $k \le N$ :

$\nabla f(x_k) = \left.\frac{\partial f(x)}{\partial x}\right\vert_{x=x_k}$

If $\nabla f(x_k)\le \epsilon$ :

Break

End if

$\alpha_k=\underset{\alpha}{\arg\min} f(x_k-\alpha \nabla f(x_k))$
$x_{k+1}=x_k-\alpha_k \nabla f(x_k)$
$$ k = k + 1 $$
End while

Return $x_{k}$ , $f(x_{k})$

One advantage of the steepest descent method is that it has a nice convergence theory. For a steepest descent method, it converges to a local minimal from any starting point.

Theorem: Global Convergence of Steepest Descent ^[2]

Let the gradient of $f \in C^1$ be uniformly Lipschitz continuous on $\mathbb{R}^{n}$ . Then, for the iterates with steepest-descent search directions, one of the following situations occurs:

$\nabla f(x_k) = 0$ for some finite $$ k $$

$\lim_{k \to \infty} f(x_k) = -\infty$

$\lim_{k \to \infty} \nabla f(x_k) = 0$

Steepest descent method is a special case of gradient descent in that the step-length is rigorously defined. Generalization can be made regarding the choice of $\alpha$ .

Inexact Search

When we minimize the objective function using numeric methods, in each iteration, the updated objective is $\phi(\alpha) = f(x_k+\alpha p_k)$ , a function of $\alpha$ when we fix the direction. Our goal is to minimize the objective with respect to $\alpha$ . However, sometimes if we want to solve for the exact minimum in each iteration, it might be computationally expensive and the algorithm will be time consuming. Therefore, in practice we just solve the subproblem

$\underset{\alpha}{min} \quad \phi(\alpha) = f(x_k + \alpha p_k)$

numerically and find a reasonable step length $\alpha$ instead, which will decrease the objective function. That is, $\alpha$ satisfies $f(x_k + \alpha p_k) \leq f(x_k)$ .A problem is, we can not guarantee the convergence to the function's minimum, so we often apply the following conditions to find an acceptable step length.

Wolfe Conditions

This condition is proposed by Phillip Wolfe in 1969. It provide an efficient way of choosing a step length that decreases the objective function sufficiently. It consists of two conditions: Armijo (sufficient decrease) condition and the curvature condition.

(1) Armijo (sufficient decrease) condition

$f(x_k + \alpha p_k) \leq f(x_k) + c_1 \alpha_{k} p^\top_k \nabla{f(x_k)}$ ,

where $c_1\in(0,1)$ and is often chosen to be of a small order of magnitude around 10E-4. This condition ensures the computed step length can reduces the objective function $$ f(x_k) $$ sufficiently. Only using this condition, however, we cannot guarantee $$ x_k $$ to converge in a reasonable number of iterations, since Armijo condition is always satisfied with step length that is small enough. Therefore, we need to pair it with the second condition below, in order to keep $\alpha_k$ from being too short.

(2) Curvature condition

$\nabla{f(x_k + \alpha p_k)}^\top p_k \geq c_2 \nabla{f(x_k)}^\top p_k$ ,

where $c_2\in(c_1,1)$ is much greater than $$ c_1 $$ and is typically on the order of 0.1. This condition ensures a sufficient increase of the gradient.

This left hand side of the curvature condition is simply the derivative of $\phi(\alpha)$ , thus ensuring $\alpha_k$ to be in the vicinity of a stationary point of $\phi(\alpha)$ .

(2*) Strong Wolfe curvature condition

The (weak) Wolfe conditions can result in an $\alpha$ value that is not close to the minimizer of $\phi(\alpha)$ . We can modify the (weak) Wolfe conditions by using the following condition called Strong Wolfe condition which writes the curvature condition in $$ (2) $$ in absolute values

$|p_k \nabla{f(x_k + \alpha p_k)| \leq c_2 |p^\top_k f(x_k)}|$ .

The strong Wolfe curvature condition restricts the slope of $\phi(\alpha)$ from getting too positive, hence excluding points far away from the stationary point of $\phi$ .

Goldstein Conditions

Another condition to find an appropriate step length is called Goldstein conditions.

$f(x_k) + (1-c) \alpha_k \nabla{f^\top_k} p_k \leq f(x_k + \alpha p_k) \leq f(x_k) + c \alpha_k \nabla{f^\top_k} p_k$

where $0 \leq c \leq 1/2$ . The Goldstein condition is quite similar with the Wolfe condition in that, its second inequality ensures that the step length $\alpha$ will decrease the objective function sufficiently and its first inequality keep $\alpha$ from being too short. In comparison with Wolfe condition, one disadvantage of Goldstein condition is that the first inequality of the condition might exclude all minimizers of $\phi$ function. However, usually it is not a fatal problem as long as the objective decrease in the direction of convergence. As a short conclusion, the Goldstein and Wolfe conditions have quite similar convergence theories. Compared to the Wolfe conditions, the Goldstein conditions are often used in Newton-type methods but are not well suited for quasi-Newton methods that maintain a positive definite Hessian approximation.

Backtracking Line Search

The backtracking method is often used to find the appropriate step length and terminate line search based. The backtracking method starts with a relatively large initial step length (e.g., 1 for Newton method), then iteratively shrinking it by a contraction factor until the Armijo (sufficient decrease) condition is satisfied. The advantage of this approach is that the curvature condition needs not be considered, and the step length found at each line search iterate is ensured to be short enough to satisfy sufficient decrease but large enough to still allow the algorithm to make reasonable progress towards convergence.

The backtracking algorithm involves control parameters $\rho\in(0,1)$ and $c\in(0,1)$ , and it is roughly as follows:

Choose

\alpha_0>0, \rho\in(0,1), c\in(0,1)

Set

\alpha\leftarrow\alpha_0

While

f(x_{k}+\alpha p_{k}) > f(x_{k})+c\alpha\nabla f_{k}^{\top}p_{k}

\alpha\leftarrow\rho\alpha

End while

Return

\alpha_k=\alpha

Numeric Example

For example, we can use line search to solve the unconstrained optimization problem

$\min f(x)=x_1+x_2+2x_1x_2+2x_1^2+x_2^2$

First iteration Starting from $$ x=[0,0]^T $$ , we have $\nabla f(x)=[1+2x_2+4x_1,-1+2x_1+2x_2]$

$\alpha_0=-\nabla f(x_0)=[-1,1]$

$x_1=[0,0]+\alpha [-1,1] =[-\alpha, \alpha]$

$f(x_1)=\alpha_0^2-2\alpha_0$ Taking partial derivative with respect to $\alpha_0$ and set it to zero $\frac{\partial f(x_1)}{\partial \alpha_0}=\alpha_0=1$ Therefore, $$ x_1=[-1,1] $$

Second iteration Given $$ x_1=[-1,1] $$ , we have $\alpha_1=-\nabla f(x_1)=[1,1]$ Then $x_2=[-1,1]+\alpha_1[1,1]=[-1+\alpha_1,1+\alpha_1]$

$f(x_2)=5\alpha_1^2-2\alpha_1-1$ Taking partial derivative with respect to $\alpha_1$ and set it to zero

$\frac{\partial f(x_2)}{\partial \alpha_1}=0$ Then we can get $\alpha_1=0.2$

Therefore, $$ x_2=[-0.8,1.2] $$

Third iteration Given $$ x_2=-0.8,1.2] $$ , we have have $\alpha_2=-\nabla f(x_2)=[-0.2,0.2]$

Then $x_3=[-0.8,1.2]+\alpha_2[-0.2,0.2]=[0.8-0.2\alpha_2,1.2+0.2\alpha_2]$

$f(x_3)=0.04\alpha_2^2-1.2-0.08\alpha_2$

Taking partial derivative with respect to $\alpha_2$

$\frac{\partial f(x_3)}{\partial \alpha_2}=0.08\alpha_2-0.08=0$ Then we can get $\alpha_2=1$

Therefore, $$ x_3=[-1,1.4] $$

Fourth iteration: Given $$ x_3=[-1,1.4] $$ , we have $\alpha_3=-\nabla f(x_3)=[0.2,0.2]$

Then $x_4=[-1,1.4]+\alpha_2[0.2,0.2]=[0.2\alpha_3-1,0.2\alpha_3+1.4]$

$f(x_4)=-0.08\alpha_3+0.2\alpha_3^2-1.24$

Taking partial derivative with respect to $\alpha_3$ ,

$\frac{\partial f(x_4)}{\partial \alpha_3}=0.4\alpha_3-0.08=0$ Then we can get $\alpha_3=0.2$

Therefore, $$ x_4=[-0.96,1.44] $$

Fifth iteration

Given $$ x_4=[-0.96,1.44] $$ , we have $\alpha_4=-\nabla f(x_3)=[-0.04,0.04]$

Then $x_5=[-0.96,1.4]+\alpha_3[-0.04,0.04]=[-0.96-0.04\alpha_4,0.04\alpha_4+1.44]$

$f(x_5)= 0.0016 \alpha _4^2-0.0032 \alpha 4-1.248$

Taking partial derivative with respect to $\alpha_4$ ,

$\frac{\partial f(x_5)}{\partial \alpha_4}=0.00324\alpha_4-0.0032=0$ Then we can get $\alpha_4=1$

Therefore, $$ x_5=[-1,1.48] $$ and $\nabla f(x)5)=[-0.04,-0.04]$

Check to see if the convergence satisfied evaluated $||\nabla f(x_5)||$ :

$|| \nabla f(x_5)||=\sqrt{(-0.04)^2+(-0.04)^2}=0.0565$ . Since 0.0565 is relatively small and is close enough to zero.

Applications

Reference

↑ Nocedal, J. & Wright, S. (2006) Numerical Optimization (Springer-Verlag New York, New York) p 38-9.
↑ Dr Raphael Hauser, Oxford University Computing Laboratory, Line Search Methods for Unconstrained Optimization link

[1] Nocedal, J. & Wright, S. (2006) Numerical Optimization (Springer-Verlag New York, New York) p 38-9.

[2] Dr Raphael Hauser, Oxford University Computing Laboratory, Line Search Methods for Unconstrained Optimization link

[1]

[2]