# Difference between revisions of "Quasi-Newton methods"

Author: Jianmin Su (ChemE 6800 Fall 2020)

Steward: Allen Yang, Fengqi You

Quasi-Newton Methods are a kind of methods used to solve nonlinear optimization problems. They are based on Newton's method yet can be an alternative to Newton's method when the objective function is not twice-differentiable, which means the Hessian matrix is unavailable, or it is too expensive to calculate the Hessian matrix and its inverse.

## Introduction

The first quasi-Newton algorithm was developed by W.C. Davidon in the mid1950s and it turned out to be a milestone in nonlinear optimization problems. He was trying to solve a long optimization calculation but he failed to get the result with the original method due to the low performances of computers at that time, thus he managed to build the quasi-Newton method to solve it. Later then, Fletcher and Powell proved that the new algorithm was more efficient and more reliable than the other existing methods.

During the following years, numerous variants were proposed, include Broyden's method (1965), the SR1 formula (Davidon 1959, Broyden 1967), the DFP method (Davidon, 1959; Fletcher and Powell, 1963), and the BFGS method (Broyden, 1969; Fletcher, 1970; Goldfarb, 1970; Shanno, 1970).

In optimization problems, Newton's method uses first and second derivatives, gradient and the Hessian in multivariate scenarios, to find the optimal point, it is applied to a twice-differentiable function ${\displaystyle f}$ to find the roots of the first derivative (solutions to ${\displaystyle f'(x)=0}$), also known as the stationary points of ${\displaystyle f}$.

The iteration of Newton's method is usually written as: ${\displaystyle x_{k+1}=x_{k}-H^{-1}\cdot \bigtriangledown f(x_{k})}$, where ${\displaystyle k}$ is the iteration number, ${\displaystyle H}$ is the Hessian matrix and ${\displaystyle H=[\bigtriangledown ^{2}f(x_{k})]}$

Iteraton would stop when it satisfies the convergence criteria like ${\displaystyle {df \over dx}=0,||\bigtriangledown f(x)||<\epsilon {\text{ or }}|f(x_{k+1})-f(x_{k})|<\epsilon }$

Though we can solve an optimization problem quickly with Newton's method, it has two obvious disadvantages:

1. The objective function must be twice-differentiable and the Hessian matrix must be positive definite.
2. The calculation is costly because it requires to compute the Jacobian matrix, Hessian matrix, and its inverse, which is time-consuming when dealing with a large-scale optimization problem.

However, we can use Quasi-Newton methods to avoid these two disadvantages.·

Quasi-Newton methods are similar to Newton's method but with one key idea that is different, they don't calculate the Hessian matrix, they introduce a matrix ${\displaystyle B}$ to estimate the Hessian matrix instead so that they can avoid the time-consuming calculations of Hessian matrix and its inverse. And there are many variants of quasi-Newton methods that simply depend on the exact methods they use in the estimation of the Hessian matrix.

## Theory and Algorithm

To illustrate the basic idea behind quasi-Newton methods, we start with building a quadratic model of the objective function at the current iterate ${\displaystyle x_{k}}$:

${\displaystyle m_{k}(p)=f_{k}+\bigtriangledown f_{k}^{T}p+{\frac {1}{2}}p^{T}B_{k}p}$ (1.1),

where ${\displaystyle B_{k}}$ is an ${\displaystyle n\times n}$ symmetric positive definite matrix that will be updated at every iteration.

The minimizer of this convex quadratic model is:

${\displaystyle p_{k}=-B_{k}^{-1}\bigtriangledown f_{k}}$ (1.2),

which is also used as the search direction.

Then the new iterate could be written as: ${\displaystyle x_{k+1}=x_{k}+\alpha _{k}p_{k}}$ (1.3),

where ${\displaystyle \alpha _{k}}$ is the step length that should satisfy the Wolfe conditions. The iteration is similar to Newton's method, but we use the approximate Hessian ${\displaystyle B_{k}}$ instead of the true Hessian.

To maintain the curve information we got from the previous iteration in ${\displaystyle B_{k+1}}$, we generate a new iterate ${\displaystyle x_{k+1}}$ and new quadratic modelto in the form of:

${\displaystyle m_{k+1}(p)=f_{k+1}+\bigtriangledown f_{k+1}^{T}p+{\frac {1}{2}}p^{T}B_{k+1}p}$ (1.4).

To construct the relationship between 1.1 and 1.4, we require that in 1.1 at ${\displaystyle p=0}$ the function value and gradient match ${\displaystyle f_{k}}$ and ${\displaystyle \bigtriangledown f_{k}}$, and the gradient of ${\displaystyle m_{k+1}}$should match the gradient of the objective function at the latest two iterates ${\displaystyle x_{k}}$and ${\displaystyle x_{k+1}}$, then we can get:

${\displaystyle \bigtriangledown m_{k+1}(-\alpha _{k}p_{k})=\bigtriangledown f_{k+1}-\alpha _{k}B_{k+1}p_{k}=\bigtriangledown f_{k}}$ (1.5)

and with some arrangements:

${\displaystyle B_{k+1}\alpha _{k}p_{k}=\bigtriangledown f_{k+1}-\bigtriangledown f_{k}}$ (1.6)

Define:

${\displaystyle s_{k}=x_{k+1}-x_{k}}$, ${\displaystyle y_{k}=\bigtriangledown f_{k+1}-\bigtriangledown f_{k}}$ (1.7)

So that (1.6) becomes: ${\displaystyle B_{k+1}s_{k}=y_{k}}$ (1.8), which is the secant equation.

To make sure ${\displaystyle B_{k+1}}$ is still a symmetric positive definite matrix, we need ${\displaystyle s_{k}^{T}s_{k}>0}$ (1.9).

To further preserve properties of ${\displaystyle B_{k+1}}$ and determine ${\displaystyle B_{k+1}}$ uniquely, we assume that among all symmetric matrices satisfying secant equation, ${\displaystyle B_{k+1}}$ is closest to the current matrix ${\displaystyle B_{k}}$, which leads to a minimization problem:

${\displaystyle B_{k+1}={\underset {B}{min}}||B-B_{k}||}$ (1.10) s.t. ${\displaystyle B=B^{T}}$, ${\displaystyle Bs_{k}=y_{k}}$,

where ${\displaystyle s_{k}}$ and ${\displaystyle y_{k}}$ satisfy (1.9) and ${\displaystyle B_{k}}$ is symmetric and positive definite.

Different matrix norms applied in (1.10) results in different quasi-Newton methods. The weighted Frobenius norm can help us get an easy solution to the minimization problem: ${\displaystyle ||A||_{W}=||W^{\frac {1}{2}}AW^{\frac {1}{2}}||_{F}}$ (1.11).

The weighted matrix ${\displaystyle W}$ can be any matrix that satisfies the relation ${\displaystyle Wy_{k}=s_{k}}$.

We skip procedures of solving the minimization problem (1.10) and here is the unique solution of (1.10):

${\displaystyle B_{k+1}=(I-\rho y_{k}s_{k}^{T})B_{k}(I-\rho s_{k}y_{k}^{T})+\rho y_{k}y_{k}^{T}}$ (1.12)

where ${\displaystyle \rho ={\frac {1}{y_{k}^{T}s_{k}}}}$ (1.13)

Finally, we get the updated ${\displaystyle B_{k+1}}$. However, according to (1.2) and (1.3), we also need the inverse of ${\displaystyle B_{k+1}}$ in next iterate.

To get the inverse of ${\displaystyle B_{k+1}}$, we can apply the Sherman-Morrison formula to avoid complicated calculation of inverse.

Set ${\displaystyle H_{k}=B_{k}^{-1}}$, with Sherman-Morrison formula we can get:

${\displaystyle H_{k+1}=H_{k}+{\frac {s_{k}s_{k}^{T}}{s_{k}^{T}y_{k}}}-{\frac {H_{k}y_{k}y_{k}^{T}H_{k}}{y_{k}^{T}H_{k}y_{k}}}}$ (1.14)

With the derivation above, we can now understand how do quasi-Newton methods get rid of calculating the Hessian matrix and its inverse. We can directly estimate the inverse of Hessian and we can use (1.14) to update the approximation of the inverse of Hessian, which leads to the DFP method, or we can directly estimate the Hessian matrix and this is the main idea in the BFGS method.

${\displaystyle B_{k}H_{k}}$

### DFP method

The DFP method, which is also known as the Davidon–Fletcher–Powell formula, is named after W.C. Davidon, Roger Fletcher, and Michael J.D. Powell. It was proposed by Davidon in 1959 first and then improved by Fletched and Powell. DFP method uses an ${\displaystyle n\times n}$ symmetric positive definite matrix ${\displaystyle B_{k}}$ to estimate the inverse of Hessian matrix and its algorithm is shown below.

#### DFP Algorithm

To avoid confusion, we use ${\displaystyle D}$ to represent the approximation of the inverse of the Hessian matrix.

1. Given the starting point ${\displaystyle x_{0}}$; convergence tolerance ${\displaystyle \epsilon ,\epsilon >0}$; the initial estimation of inverse Hessian matrix ${\displaystyle D_{0}=I}$; ${\displaystyle k=0}$.
2. Compute the search direction ${\displaystyle d_{k}=-D_{k}\cdot \bigtriangledown f_{k}}$.
3. Compute the step length ${\displaystyle \lambda _{k}}$ with a line search procedure that satisfies Wolfe conditions. And then set
${\displaystyle s_{k}={\lambda }_{k}d_{k}}$,
${\displaystyle x_{k+1}=x_{k}+s_{k}}$
4. If ${\displaystyle ||\bigtriangledown f_{k+1}||<\epsilon }$, then end of the iteration, otherwise continue step5.
5. Computing ${\displaystyle y_{k}=g_{k+1}-g_{k}}$.
6. Update the ${\displaystyle D_{k+1}}$ with
${\displaystyle D_{k+1}=D_{k}+{\frac {s_{k}s_{k}^{T}}{s_{k}^{T}y_{k}}}-{\frac {D_{k}y_{k}y_{k}^{T}D_{k}}{y_{k}^{T}D_{k}y_{k}}}}$
7. Update ${\displaystyle k}$ with ${\displaystyle k=k+1}$ and go back to step2.

### BFGS method

BFGS method is named for its four discoverers Broyden, Fletcher, Goldfarb, and Shanno, it is considered as the most effective quasi-Newton algorithm. Unlike the DFP method, the BFGS method uses an ${\displaystyle n\times n}$ symmetric positive definite matrix ${\displaystyle B_{k}}$ to estimate the Hessian matrix.

#### BFGS Algorithm

1. Given the starting point ${\displaystyle x_{0}}$; convergence tolerance ${\displaystyle \epsilon ,\epsilon >0}$; the initial estimation of Hessian matrix ${\displaystyle B_{0}=I}$; ${\displaystyle k=0}$.
2. Compute the search direction ${\displaystyle d_{k}=-B_{k}^{-1}\cdot \bigtriangledown f_{k}}$.
3. Compute the step length ${\displaystyle \lambda _{k}}$ with a line search procedure that satisfies Wolfe conditions. And then set
${\displaystyle s_{k}={\lambda }_{k}d_{k}}$,
${\displaystyle x_{k+1}=x_{k}+s_{k}}$
4. If ${\displaystyle ||\bigtriangledown f_{k+1}||<\epsilon }$, then end of the iteration, otherwise continue step5.
5. Computing ${\displaystyle y_{k}=\bigtriangledown f_{k+1}-\bigtriangledown f_{k}}$.
6. According to (1.12), (1.13) and (1.14), update the ${\displaystyle B_{k+1}^{-1}}$ with
${\displaystyle B_{k+1}^{-1}=(I-\rho s_{k}y_{k}^{T})B_{k}^{-1}(I-\rho y_{k}s_{k}^{T})+\rho s_{k}s_{k}^{T}}$ , ${\displaystyle \rho ={\frac {1}{y_{k}^{T}s_{k}}}}$
7. Update ${\displaystyle k}$ with ${\displaystyle k=k+1}$ and go back to step2.

## Numerical Example

The following is an example to show how to solve an unconstrained nonlinear optimization problem with the DFP method.

{\displaystyle {\text{min }}{\begin{aligned}f(x_{1},x_{2})&=x_{1}^{2}+{\frac {1}{2}}x_{2}^{2}+3\end{aligned}}}

Step 1:

${\displaystyle x_{0}=(1,2)^{T}}$

${\displaystyle B_{0}}$: ${\displaystyle {\begin{pmatrix}1&0\\0&1\end{pmatrix}}}$

${\displaystyle \bigtriangledown f_{x}}$: ${\displaystyle {\begin{pmatrix}2x_{1}\\x_{2}\end{pmatrix}}}$

${\displaystyle \epsilon =10^{-5}}$

${\displaystyle k=0}$

${\displaystyle \lambda =1}$

Step 2: ${\displaystyle d_{0}=-B_{0}^{-1}\bigtriangledown f_{0}}$${\displaystyle =-{\begin{pmatrix}1&0\\0&1\end{pmatrix}}}$${\displaystyle {\begin{pmatrix}2\\2\end{pmatrix}}}$ ${\displaystyle ={\begin{pmatrix}-2\\-2\end{pmatrix}}}$

Step 3: ${\displaystyle s_{0}=d_{0}}$

${\displaystyle x_{1}=x_{0}+s_{0}}$${\displaystyle ={\begin{pmatrix}1\\2\end{pmatrix}}}$${\displaystyle +{\begin{pmatrix}-2\\-2\end{pmatrix}}}$${\displaystyle ={\begin{pmatrix}-1\\0\end{pmatrix}}}$

Step 4: ${\displaystyle \bigtriangledown f_{0}}$${\displaystyle ={\begin{pmatrix}-2\\0\end{pmatrix}}}$

Step 5: ${\displaystyle y_{0}=\bigtriangledown f_{1}-\bigtriangledown f_{0}}$${\displaystyle ={\begin{pmatrix}-4\\-2\end{pmatrix}}}$

Step 6:

${\displaystyle B_{1}=B_{0}+{\frac {s_{0}s_{0}^{T}}{s_{0}^{T}y_{0}}}-{\frac {D_{0}y_{0}y_{0}^{T}D_{0}}{y_{0}^{T}D_{0}y_{0}}}}$${\displaystyle ={\begin{pmatrix}1&0\\0&1\end{pmatrix}}}$${\displaystyle +{\frac {1}{12}}{\begin{pmatrix}4&4\\4&4\end{pmatrix}}}$${\displaystyle -{\frac {1}{20}}{\begin{pmatrix}16&8\\8&4\end{pmatrix}}}$ ${\displaystyle ={\begin{pmatrix}0.53333&-0.0667\\-0.0667&1.1333\end{pmatrix}}}$

## Application

Quasi-Newton methods are used in various programming languages to solve optimization problems:

1. Mathematica includes quasi-Newton solvers.
2. In MATLAB's Optimization Toolbox, the fminunc function uses (among other methods) the BFGS quasi-Newton method.
3. R's optim general-purpose optimizer routine uses the BFGS method by using method="BFGS".<ref>
4. Scipy.optimize has fmin_bfgs. In the SciPy extension to Python, the scipy.optimize.minimize function includes, among other methods, a BFGS implementation.