Author: Alexandra Roberts, Anye Shi, Yue Sun (CHEME/SYSEN 6800, Fall 2021)

Introduction

Figure 1. A comparison of the convergence of gradient descent (in red) and conjugate vector (in green) for minimizing a quadratic function. In theory, the conjugate gradient method converges in at most ${\displaystyle n}$ steps, where ${\displaystyle n}$ is the size of the matrix of the system (here ${\displaystyle n=2}$).[1]

The conjugate gradient method (CG) was originally invented to minimize a quadratic function:

 ${\displaystyle F({\textbf {x}})={\frac {1}{2}}{\textbf {x}}^{T}{\textbf {A}}{\textbf {x}}-{\textbf {b}}{\textbf {x}}}$


where ${\displaystyle {\textbf {A}}}$ is an ${\displaystyle n\times n}$ symmetric positive definite matrix, ${\displaystyle {\textbf {x}}}$ and ${\displaystyle {\textbf {b}}}$ are ${\displaystyle n\times 1}$ vectors. The solution to the minimization problem is equivalent to solving the linear system, i.e. determining ${\displaystyle {\textbf {x}}}$ when ${\displaystyle \nabla F(x)=0}$, i.e. ${\displaystyle {\textbf {A}}{\textbf {x}}-{\textbf {b}}={\textbf {0}}}$.

The conjugate gradient method is often implemented as an iterative algorithm and can be considered as being between Newton’s method, a second-order method that incorporates Hessian and gradient, and the method of steepest descent, a first-order method that uses gradient. Newton's Method usually reduces the number of iterations needed, but the calculation of the Hessian matrix and its inverse increases the computation required for each iteration. Steepest descent takes repeated steps in the opposite direction of the gradient of the function at the current point. It often takes steps in the same direction as earlier ones, resulting in slow convergence (Figure 1). To avoid the high computational cost of Newton’s method and to accelerate the convergence rate of steepest descent, the conjugate gradient method was developed.

The idea of the CG method is to pick ${\displaystyle n}$ orthogonal search directions first and, in each search direction, take exactly one step such that the step size is to the proposed solution ${\displaystyle x}$ at that direction. The solution is reached after ${\displaystyle n}$ steps [2] as, theoretically, the number of iterations needed by the CG method is equal to the number of different eigenvalues of ${\displaystyle {\textbf {A}}}$, i.e. at most ${\displaystyle n}$. This makes it attractive for large and sparse problems. The method can be used to solve least-squares problems and can also be generalized to a minimization method for general smooth functions [2].

Theory

The definition of A-conjugate direction

Figure 2. The method of conjugate directions. Here, we use the coordinate axes as search directions.[3]

Let ${\displaystyle {\textbf {A}}}$ be a symmetric positive definite matrix. ${\displaystyle \left\{{\textbf {d}}_{0},{\textbf {d}}_{1},...,{\textbf {d}}_{n-1}\right\},{\textbf {d}}_{i}\in {\textbf {R}}^{n},{\textbf {d}}_{i}\neq 0}$ are the search directional vectors that are orthogonal (conjugate) to each other with respect to ${\displaystyle {\textbf {A}}}$ if ${\displaystyle {\textbf {d}}_{i}^{T}{\textbf {A}}{\textbf {d}}_{j}=0,\forall i\neq j}$. Note that if ${\displaystyle {\textbf {A}}=0}$, any two vectors will be conjugated to each other. If ${\displaystyle {\textbf {A}}={\textbf {I}}}$, conjugacy is equivalent to the conventional notion of orthogonality. If ${\displaystyle \left\{{\textbf {d}}_{0},{\textbf {d}}_{1},...,{\textbf {d}}_{n-1}\right\}}$ are ${\displaystyle {\textbf {A}}}$-conjugated to each other, then the set of vectors ${\displaystyle \left\{{\textbf {d}}_{0},{\textbf {d}}_{1},...,{\textbf {d}}_{n-1}\right\}}$ are linearly independent.

To illustrate the idea of conjugacy, the coordinate axes can be specified as search directions (Figure 2). The first step (horizontal) leads to the correct ${\displaystyle x}$ coordinate of the solution ${\displaystyle {\textbf {x}}^{*}}$. The second step (vertical) will reach the correct ${\displaystyle y}$ coordinate of the solution ${\displaystyle {\textbf {x}}^{*}}$ and then finish searching. We define that the error ${\displaystyle {\textbf {e}}_{i}={\textbf {x}}^{*}-{\textbf {x}}_{i}}$ is a vector that indicates how far ${\displaystyle {\textbf {x}}_{i}}$ is from the solution. The residual ${\displaystyle {\textbf {g}}_{i}={\textbf {b}}-{\textbf {A}}{\textbf {x}}_{i}}$ indicates how far ${\displaystyle {\textbf {Ax}}_{i}}$ is from the correct value of ${\displaystyle {\textbf {b}}}$. Notice that ${\displaystyle {\textbf {e}}_{1}}$ is orthogonal to ${\displaystyle {\textbf {d}}_{0}}$. In general, for each step we choose a point

 ${\displaystyle {\textbf {x}}_{i+1}={\textbf {x}}_{i}+\alpha _{i}{\textbf {d}}_{i}}$.


To find the value of ${\displaystyle \alpha _{i}}$, use the fact that ${\displaystyle {\textbf {e}}_{i}}$ should be orthogonal to ${\displaystyle {\textbf {d}}_{i}}$ so that no need to step in the direction of ${\displaystyle {\textbf {d}}_{i}}$ again in the later search. Using this condition, we have

 ${\displaystyle {\textbf {d}}_{i}^{T}{\textbf {e}}_{i+1}=0}$

 ${\displaystyle {\textbf {d}}_{i}^{T}({\textbf {e}}_{i}+\alpha _{i}{\textbf {d}}_{i})=0}$

 ${\displaystyle \alpha _{i}=-{\frac {{\textbf {d}}_{i}^{T}{\textbf {e}}_{i}}{{\textbf {d}}_{i}^{T}{\textbf {d}}_{i}}}}$.


The motivation of A-conjugacy[4]

Notice from the above equation that the error ${\displaystyle e_{i}}$ should be given first to get ${\displaystyle \alpha _{i}}$. If ${\displaystyle e_{i}}$ is known, it would mean that the solution ${\displaystyle {\textbf {x}}^{*}}$ is already known, which is problematic. So, to compute ${\displaystyle \alpha _{i}}$ without knowing ${\displaystyle {\textbf {e}}_{i}}$, let ${\displaystyle D=\left\{{\textbf {d}}_{0},{\textbf {d}}_{1},...,{\textbf {d}}_{n-1}\right\}}$ be a set of ${\displaystyle n}$ ${\displaystyle {\textbf {A}}}$-conjugate vectors, then ${\displaystyle D}$ can be used as a basis and express the solution ${\displaystyle {\textbf {x}}^{\ast }}$ to ${\displaystyle \nabla F({\textbf {x}})={\textbf {Ax}}-{\textbf {b}}={\textbf {0}}}$ is:

 ${\displaystyle {\textbf {x}}^{\ast }=\sum _{i=0}^{n-1}\alpha _{i}{\textbf {d}}_{i}}$


Then conduct transformation ${\displaystyle {\textbf {A}}}$ on both sides

 ${\displaystyle {\textbf {A}}{\textbf {x}}^{\ast }=\sum _{i=0}^{n-1}\alpha _{i}{\textbf {A}}{\textbf {d}}_{i}}$


Then multiply ${\displaystyle {\textbf {d}}_{k}^{T}}$ on both sides

 ${\displaystyle {\textbf {d}}_{k}^{T}{\textbf {A}}{\textbf {x}}^{*}=\sum _{i=0}^{n-1}\alpha _{i}{\textbf {d}}_{k}^{T}{\textbf {A}}{\textbf {d}}_{i}}$


Because ${\displaystyle {\textbf {Ax}}={\textbf {b}}}$ and the ${\displaystyle {\textbf {A}}}$-conjugacy of ${\displaystyle D}$, i.e. ${\displaystyle {\textbf {d}}_{k}^{T}{\textbf {A}}{\textbf {d}}_{i}=0,\forall k\neq i}$, the multiplication will cancel out all the terms except for term ${\displaystyle k}$

 ${\displaystyle {\textbf {d}}_{k}^{T}{\textbf {b}}=\alpha _{k}{\textbf {d}}_{k}^{T}{\textbf {A}}{\textbf {d}}_{k}}$


So ${\displaystyle \alpha _{k}}$ can be obtained as the following:

 ${\displaystyle \alpha _{k}={\frac {{\textbf {d}}_{k}^{T}{\textbf {b}}}{{\textbf {d}}_{k}^{T}{\textbf {A}}{\textbf {d}}_{k}}}}$


Then the solution ${\displaystyle {\textbf {x}}^{*}}$ will be

 ${\displaystyle {\textbf {x}}^{*}=\sum _{i=0}^{n-1}\alpha _{i}{\textbf {d}}_{i}=\sum _{i=0}^{n-1}{\frac {{\textbf {d}}_{i}^{T}{\textbf {b}}}{{\textbf {d}}_{i}^{T}{\textbf {A}}{\textbf {d}}_{i}}}{\textbf {d}}_{i}}$.


It takes ${\displaystyle n}$ steps to reach ${\displaystyle {\textbf {x}}^{*}}$. Additionally, because ${\displaystyle {\textbf {A}}}$ is a symmetric and positive-definite matrix, so the term ${\displaystyle {\textbf {d}}_{i}^{T}{\textbf {A}}{\textbf {d}}_{i}}$ defines an inner product and, therefore, no need to calculate the inversion of matrix ${\displaystyle {\textbf {A}}}$.

Conjugate Direction Theorem

To demonstrates that the above procedure does compute ${\displaystyle {\textbf {x}}^{*}}$ in ${\displaystyle n}$ steps, we introduce the conjugate direction theorem. Similar as the above, let ${\displaystyle D=\left\{{\textbf {d}}_{0},{\textbf {d}}_{1},...,{\textbf {d}}_{n-1}\right\}}$ be a set of ${\displaystyle n}$ ${\displaystyle {\textbf {A}}}$-conjugate vectors, ${\displaystyle {\textbf {x}}_{0}\in {\textbf {R}}^{n}}$ be a random starting point. Then the theorem is as the following:

 ${\displaystyle {\textbf {x}}_{k+1}={\textbf {x}}_{k}+\alpha _{k}{\textbf {d}}_{k}}$

 ${\displaystyle {\textbf {g}}_{k}={\textbf {b}}-{\textbf {A}}{\textbf {x}}_{k}}$

 ${\displaystyle \alpha _{k}={\frac {{\textbf {g}}_{k}^{T}{\textbf {d}}_{k}}{{\textbf {d}}_{k}^{T}{\textbf {A}}{\textbf {d}}_{k}}}={\frac {({\textbf {b}}-{\textbf {A}}{\textbf {x}}_{k})^{T}{\textbf {d}}_{k}}{{\textbf {d}}_{k}^{T}{\textbf {A}}{\textbf {d}}_{k}}}}$


After ${\displaystyle n}$ steps, ${\displaystyle {\textbf {x}}_{n}={\textbf {x}}^{*}}$.

Proof:
The error ${\displaystyle e_{0}}$ from the starting point to the solution can be formulated as the following:

 ${\displaystyle {\textbf {x}}^{*}-{\textbf {x}}_{0}=\alpha _{0}{\textbf {d}}_{0}+\alpha _{1}{\textbf {d}}_{1}+...+\alpha _{n-1}{\textbf {d}}_{n-1}}$


And the traversed steps from the starting point to the point ${\displaystyle {\textbf {x}}_{k}}$ can be expressed as the following:

 ${\displaystyle {\textbf {x}}_{k}-{\textbf {x}}_{0}=\alpha _{0}{\textbf {d}}_{0}+\alpha _{1}{\textbf {d}}_{1}+...+\alpha _{k-1}{\textbf {d}}_{k-1}}$


The residual term from the solution point to the point ${\displaystyle {\textbf {x}}_{k}}$ can be expressed as the following:

 ${\displaystyle {\textbf {g}}_{k}={\textbf {b}}-{\textbf {A}}{\textbf {x}}_{k}={\textbf {A}}{\textbf {x}}^{*}-{\textbf {A}}{\textbf {x}}_{k}={\textbf {A}}({\textbf {x}}^{*}-{\textbf {x}}_{k})}$


Therefore, the ${\displaystyle \alpha _{k}}$ is as the following:

 ${\displaystyle {\textbf {d}}_{k}^{T}{\textbf {A}}({\textbf {x}}^{*}-{\textbf {x}}_{0})={\textbf {d}}_{k}^{T}{\textbf {A}}(\alpha _{0}{\textbf {d}}_{0}+\alpha _{1}{\textbf {d}}_{1}+...+\alpha _{n-1}{\textbf {d}}_{n-1})=\alpha _{k}{\textbf {d}}_{k}^{T}{\textbf {A}}{\textbf {d}}_{k}}$

 ${\displaystyle \alpha _{k}={\frac {{\textbf {d}}_{k}^{T}{\textbf {A}}({\textbf {x}}^{*}-{\textbf {x}}_{0})}{{\textbf {d}}_{k}^{T}{\textbf {A}}{\textbf {d}}_{k}}}}$


But, still, the solution ${\displaystyle {\textbf {x}}^{*}}$ has to be known first to calculate ${\displaystyle \alpha _{k}}$. So, the following manipulations can get rid of using ${\displaystyle {\textbf {x}}^{*}}$ in the above expression:

 ${\displaystyle {\textbf {d}}_{k}^{T}{\textbf {A}}({\textbf {x}}_{k}-{\textbf {x}}_{0})={\textbf {d}}_{k}^{T}{\textbf {A}}(\alpha _{0}{\textbf {d}}_{0}+\alpha _{1}{\textbf {d}}_{1}+...+\alpha _{k-1}{\textbf {d}}_{k-1})=0}$

 ${\displaystyle {\textbf {d}}_{k}^{T}{\textbf {A}}({\textbf {x}}^{*}-{\textbf {x}}_{0})={\textbf {d}}_{k}^{T}{\textbf {A}}({\textbf {x}}^{*}-{\textbf {x}}_{k}+{\textbf {x}}_{k}-{\textbf {x}}_{0})={\textbf {d}}_{k}^{T}{\textbf {A}}({\textbf {x}}^{*}-{\textbf {x}}_{k})}$


Therefore

 ${\displaystyle \alpha _{k}={\frac {{\textbf {d}}_{k}^{T}{\textbf {A}}({\textbf {x}}^{*}-{\textbf {x}}_{0})}{{\textbf {d}}_{k}^{T}{\textbf {A}}{\textbf {d}}_{k}}}={\frac {{\textbf {d}}_{k}^{T}{\textbf {A}}({\textbf {x}}^{*}-{\textbf {x}}_{k})}{{\textbf {d}}_{k}^{T}{\textbf {A}}{\textbf {d}}_{k}}}={\frac {{\textbf {d}}_{k}^{T}{\textbf {g}}_{k}}{{\textbf {d}}_{k}^{T}{\textbf {A}}{\textbf {d}}_{k}}}}$


The conjugate gradient method is a conjugate direction method in which selected successive direction vectors are treated as a conjugate version of the successive gradients obtained while the method progresses. The conjugate directions are not specified beforehand but rather are determined sequentially at each step of the iteration [4]. If the conjugate vectors are carefully chosen, then not all the conjugate vectors may be needed to obtain the solution. Therefore, the conjugate gradient method is regarded as an iterative method. This also allows approximate solutions to systems where n is so large that the direct method requires too much time [2].

Derivation of Algorithm[2]

Given ${\displaystyle D=\left\{{\textbf {d}}_{1},{\textbf {d}}_{2},...,{\textbf {d}}_{n}\right\}}$ be a set of ${\displaystyle n}$ ${\displaystyle {\textbf {A}}}$-conjugate vectors, then ${\displaystyle F({\textbf {x}}_{0}+\alpha _{1}{\textbf {d}}_{1}+\alpha _{2}{\textbf {d}}_{2}+...+\alpha _{n}{\textbf {d}}_{n})}$ can be minimized by stepping from ${\displaystyle {\textbf {x}}_{0}}$ along ${\displaystyle {\textbf {d}}_{1}}$ to the minimum ${\displaystyle {\textbf {x}}_{1}}$, stepping from ${\displaystyle {\textbf {x}}_{1}}$ along ${\displaystyle {\textbf {d}}_{2}}$ to the minimum ${\displaystyle {\textbf {x}}_{2}}$, etc. And let ${\displaystyle {\textbf {x}}_{0}\in {\textbf {R}}_{n}}$ be randomly chosen, then the algorithm is the following:

 ${\displaystyle {\underline {Algo\;1:Pick\;\left\{{\textbf {d}}_{1},{\textbf {d}}_{2},...,{\textbf {d}}_{n}\right\}\;mutually\;{\textbf {A}}-conjugate,\;and\;start\;from\;a\;random\;{\textbf {x}}_{0}}}}$ Set a starting point ${\displaystyle {\textbf {x}}_{0}}$ Set ${\displaystyle k=1}$ Set the maximum iteration ${\displaystyle n}$ While ${\displaystyle k\leq n}$: ${\displaystyle \alpha _{k}={\textbf {d}}_{k}^{T}({\textbf {b}}-{\textbf {A}}{\textbf {x}}_{k-1})/{\textbf {d}}_{k}^{T}{\textbf {A}}{\textbf {d}}_{k}}$ ${\displaystyle {\textbf {x}}_{k}={\textbf {x}}_{k-1}+\alpha _{k}{\textbf {d}}_{k}}$ ${\displaystyle k=k+1}$ End while Return ${\displaystyle {\textbf {x}}_{n}}$

Here Alg 1 is with a particular choice of ${\displaystyle \left\{{\textbf {d}}_{1},{\textbf {d}}_{2},...,{\textbf {d}}_{n}\right\}}$. Let ${\displaystyle {\textbf {g}}_{k}={\textbf {b}}-{\textbf {A}}{\textbf {x}}_{k}}$ be the residual (equivalent to the negative of the gradient) at ${\displaystyle {\textbf {x}}_{k}}$. A practical way to enforce this is by requiring that the next search direction be built out of the current gradient and all previous search directions. The CG method picks ${\displaystyle {\textbf {d}}_{k+1}}$ as the component of ${\displaystyle {\textbf {g}}_{k}}$ ${\displaystyle {\textbf {A}}}$-conjugate to ${\displaystyle \left\{{\textbf {d}}_{1},{\textbf {d}}_{2},...,{\textbf {d}}_{k}\right\}}$:

 ${\displaystyle {\textbf {d}}_{k+1}={\textbf {g}}_{k}-\sum _{i=1}^{k}{\frac {{\textbf {g}}_{k}^{T}{\textbf {A}}{\textbf {d}}_{i}}{{\textbf {d}}_{i}^{T}{\textbf {A}}{\textbf {d}}_{i}}}{\textbf {d}}_{i}}$


As ${\displaystyle {\textbf {g}}_{k}^{T}{\textbf {A}}{\textbf {d}}_{i}=0}$, for ${\displaystyle i=1,...,k}$, giving the following CG algorithm:

 ${\displaystyle {\underline {Alg\;2:\;Revised\;algorithm\;from\;Alg\;1\;without\;particular\;choice\;of\;D}}}$ Set a starting point ${\displaystyle {\textbf {x}}_{0}}$ Set ${\displaystyle k=1}$ Set the maximum iteration ${\displaystyle n}$ While ${\displaystyle k\leq n}$: ${\displaystyle {\textbf {g}}_{k-1}={\textbf {b}}-{\textbf {A}}{\textbf {x}}_{k-1}}$ if ${\displaystyle {\textbf {g}}_{k-1}=0}$ return ${\displaystyle {\textbf {x}}_{k-1}}$ if (k > 1) ${\displaystyle \beta _{k}={\textbf {d}}_{k-1}^{T}{\textbf {A}}{\textbf {g}}_{k-1}/{{\textbf {d}}_{k-1}^{T}{\textbf {A}}{\textbf {d}}_{k-1}}}$ if (k = 1) ${\displaystyle {\textbf {d}}_{k}={\textbf {g}}_{0}}$ else ${\displaystyle {\textbf {d}}_{k}={\textbf {g}}_{k-1}-\beta _{k}{\textbf {d}}_{k-1}}$ ${\displaystyle \alpha _{k}={\textbf {d}}_{k}^{T}{\textbf {g}}_{k-1}/{\textbf {d}}_{k}^{T}{\textbf {A}}{\textbf {d}}_{k}}$ ${\displaystyle {\textbf {x}}_{k}={\textbf {x}}_{k-1}+\alpha _{k}{\textbf {d}}_{k}}$ ${\displaystyle k=k+1}$ End while Return ${\displaystyle {\textbf {x}}_{n}}$

The formulas in the Alg 2 can be simplified as the following:

 ${\displaystyle {\textbf {x}}_{i}={\textbf {x}}_{i-1}+\alpha _{i}{\textbf {d}}_{i}}$

${\displaystyle {\textbf {b}}-{\textbf {A}}{\textbf {x}}_{i}={\textbf {b}}-{\textbf {A}}{\textbf {x}}_{i-1}-\alpha _{i}{\textbf {A}}{\textbf {d}}_{i}}$

${\displaystyle {\textbf {g}}_{i}={\textbf {g}}_{i-1}-\alpha _{i}{\textbf {A}}{\textbf {d}}_{i}}$


Then ${\displaystyle \beta _{i}}$ and ${\displaystyle \alpha _{i}}$ can be simplified by multiplying the above gradient formula by ${\displaystyle {\textbf {g}}_{i}}$ and ${\displaystyle {\textbf {g}}_{i-1}}$ as the following:

 ${\displaystyle {\textbf {g}}_{i}^{T}{\textbf {g}}_{i}=-\alpha _{i}{\textbf {g}}_{i}^{T}{\textbf {A}}{\textbf {d}}_{i}}$

${\displaystyle {\textbf {g}}_{i-1}^{T}{\textbf {g}}_{i-1}=\alpha _{i}{\textbf {g}}_{i-1}^{T}{\textbf {A}}{\textbf {d}}_{i}}$


As ${\displaystyle {\textbf {g}}_{i-1}={\textbf {d}}_{i}+\beta _{i}{\textbf {d}}_{i-1}}$, so we have

 ${\displaystyle {\textbf {g}}_{i-1}^{T}{\textbf {g}}_{i-1}=\alpha _{i}{\textbf {g}}_{i-1}^{T}{\textbf {A}}{\textbf {d}}_{i}=\alpha _{i}{\textbf {d}}_{i}^{T}{\textbf {A}}{\textbf {d}}_{i}}$


Therefore

 ${\displaystyle \beta _{i+1}=-{\frac {{\textbf {g}}_{i}^{T}{\textbf {g}}_{i}}{{\textbf {g}}_{i-1}^{T}{\textbf {g}}_{i-1}}}}$


This gives the following simplified version of Alg 2:

 ${\displaystyle {\underline {Alg\;3:\;Simplified\;version\;from\;Alg\;2}}}$ Set a starting point ${\displaystyle {\textbf {x}}_{0}}$ Set ${\displaystyle {\textbf {g}}_{0}={\textbf {b}}-{\textbf {Ax}}_{0}}$ Set ${\displaystyle k=1}$ Set the maximum iteration ${\displaystyle n}$ While ${\displaystyle k\leq n}$: if ${\displaystyle {\textbf {g}}_{k-1}=0}$ return ${\displaystyle {\textbf {x}}_{k-1}}$ if (k > 1) ${\displaystyle \beta _{k}=-({\textbf {g}}_{k-1}^{T}{\textbf {g}}_{k-1})/({\textbf {g}}_{k-2}^{T}{\textbf {g}}_{k-2})}$ if (k = 1) ${\displaystyle {\textbf {d}}_{k}={\textbf {g}}_{0}}$ else ${\displaystyle {\textbf {d}}_{k}={\textbf {g}}_{k-1}-\beta _{k}{\textbf {d}}_{k-1}}$ ${\displaystyle \alpha _{k}=({\textbf {g}}_{k-1}^{T}{\textbf {g}}_{k-1})/({\textbf {d}}_{k}^{T}{\textbf {A}}{\textbf {d}}_{k})}$ ${\displaystyle {\textbf {x}}_{k}={\textbf {x}}_{i-1}+\alpha _{i}{\textbf {d}}_{i}}$ ${\displaystyle {\textbf {g}}_{i}={\textbf {g}}_{i-1}-\alpha _{i}{\textbf {A}}{\textbf {d}}_{i}}$ ${\displaystyle k=k+1}$ End while Return ${\displaystyle {\textbf {x}}_{n}}$

Numerical example

Consider the linear system ${\displaystyle {\textbf {A}}{\textbf {x}}={\textbf {b}}}$

 ${\displaystyle {\textbf {A}}{\textbf {x}}={\begin{bmatrix}5&1\\1&8\\\end{bmatrix}}{\begin{bmatrix}x_{1}\\x_{2}\end{bmatrix}}={\begin{bmatrix}3\\2\end{bmatrix}}={\textbf {b}}}$.


The initial starting point is set to be

 ${\displaystyle {\textbf {x}}_{0}={\begin{bmatrix}2\\1\end{bmatrix}}}$.


Implement the conjugate gradient method to approximate the solution to the system.

Solution:
The exact solution is given below for later reference:

 ${\displaystyle {\textbf {x}}_{*}={\begin{bmatrix}22/39\\7/39\end{bmatrix}}\approx {\begin{bmatrix}0.5641\\0.1794\end{bmatrix}}}$.


Step 1: since this is the first iteration, use the residual vector ${\displaystyle {\textbf {g}}_{0}}$ as the initial search direction ${\displaystyle {\textbf {d}}_{1}}$. The method of selecting ${\displaystyle {\textbf {d}}_{k}}$ will change in further iterations.

 ${\displaystyle {\textbf {g}}_{0}={\textbf {b}}-{\textbf {A}}{\textbf {x}}_{0}={\begin{bmatrix}3\\2\end{bmatrix}}-{\begin{bmatrix}5&1\\1&8\\\end{bmatrix}}{\begin{bmatrix}2\\1\end{bmatrix}}={\begin{bmatrix}-8\\-8\end{bmatrix}}={\textbf {d}}_{1}}$


Step 2: the scalar ${\displaystyle \alpha _{1}}$ can be calculated using the relationship

 ${\displaystyle \alpha _{1}={\frac {{\textbf {g}}_{0}^{T}{\textbf {g}}_{0}}{{\textbf {d}}_{1}^{T}{\textbf {A}}{\textbf {d}}_{1}}}={\frac {{\begin{bmatrix}-8&-8\\\end{bmatrix}}{\begin{bmatrix}-8\\-8\end{bmatrix}}}{{\begin{bmatrix}-8&-8\\\end{bmatrix}}{\begin{bmatrix}5&1\\1&8\\\end{bmatrix}}{\begin{bmatrix}-8\\-8\end{bmatrix}}}}={\frac {2}{15}}}$


Step 3: so ${\displaystyle {\textbf {x}}_{1}}$ can be reached by the following formula, and this step finishes the first iteration, the result being an "improved" approximate solution to the system, ${\displaystyle {\textbf {x}}_{1}}$:

 ${\displaystyle {\textbf {x}}_{1}={\textbf {x}}_{0}+\alpha _{1}{\textbf {d}}_{1}={\begin{bmatrix}2\\1\end{bmatrix}}+{\frac {2}{15}}{\begin{bmatrix}-8\\-8\end{bmatrix}}={\begin{bmatrix}0.9333\\-0.0667\end{bmatrix}}}$


Step 4: now move on and compute the next residual vector ${\displaystyle {\textbf {g}}_{1}}$ using the formula

 ${\displaystyle {\textbf {g}}_{1}={\textbf {g}}_{0}-\alpha _{1}{\textbf {A}}{\textbf {d}}_{1}={\begin{bmatrix}-8\\-8\end{bmatrix}}-{\frac {2}{15}}{\begin{bmatrix}5&1\\1&8\end{bmatrix}}{\begin{bmatrix}-8\\-8\end{bmatrix}}={\begin{bmatrix}-1.6\\1.6\end{bmatrix}}}$


Step 5: compute the scalar ${\displaystyle \beta _{2}}$ that will eventually be used to determine the next search direction ${\displaystyle {\textbf {d}}_{2}}$.

 ${\displaystyle \beta _{2}=-{\frac {{\textbf {g}}_{1}^{T}{\textbf {g}}_{1}}{{\textbf {g}}_{0}^{T}{\textbf {g}}_{0}}}=-{\frac {{\begin{bmatrix}-1.6&1.6\\\end{bmatrix}}{\begin{bmatrix}-1.6\\1.6\end{bmatrix}}}{{\begin{bmatrix}-8&-8\\\end{bmatrix}}{\begin{bmatrix}-8\\-8\end{bmatrix}}}}=-0.04}$


Step 6: use this scalar ${\displaystyle \beta _{2}}$ to compute the next search direction ${\displaystyle {\textbf {d}}_{2}}$:

 ${\displaystyle {\textbf {d}}_{2}={\textbf {g}}_{1}-\beta _{2}{\textbf {d}}_{1}={\begin{bmatrix}-1.6\\1.6\end{bmatrix}}+0.04{\begin{bmatrix}-8\\-8\end{bmatrix}}={\begin{bmatrix}-1.92\\1.28\end{bmatrix}}}$


Step 7: now compute the scalar ${\displaystyle \alpha _{2}}$ using newly acquired ${\displaystyle {\textbf {d}}_{2}}$ by the same method as that used for ${\displaystyle \alpha _{1}}$.

 ${\displaystyle \alpha _{2}={\frac {{\textbf {g}}_{1}^{T}{\textbf {g}}_{1}}{{\textbf {d}}_{2}^{T}{\textbf {A}}{\textbf {d}}_{2}}}={\frac {{\begin{bmatrix}-1.6&1.6\end{bmatrix}}{\begin{bmatrix}-1.6\\1.6\end{bmatrix}}}{{\begin{bmatrix}-1.92&1.28\\\end{bmatrix}}{\begin{bmatrix}5&1\\1&8\\\end{bmatrix}}{\begin{bmatrix}-1.92\\1.28\end{bmatrix}}}}=0.1923}$


Step 8: finally, find ${\displaystyle {\textbf {x}}_{2}}$ using the same method as that used to find ${\displaystyle {\textbf {x}}_{1}}$.

 ${\displaystyle {\textbf {x}}_{2}={\textbf {x}}_{1}+\alpha _{2}{\textbf {d}}_{2}={\begin{bmatrix}0.9333\\-0.0667\end{bmatrix}}+0.1923{\begin{bmatrix}-1.92\\1.28\end{bmatrix}}={\begin{bmatrix}0.5641\\0.1794\end{bmatrix}}}$


Therefore, ${\displaystyle {\textbf {x}}_{2}}$ is the approximation result of the system.

Application

Conjugate gradient methods have often been used to solve a wide variety of numerical problems, including linear and nonlinear algebraic equations, eigenvalue problems and minimization problems. These applications have been similar in that they involve large numbers of variables or dimensions. In these circumstances any method of solution which involves storing a full matrix of this large order, becomes inapplicable. Thus recourse to the conjugate gradient method may be the only alternative [5]. Here we demonstrate three application examples of CG method.

Iterative image reconstruction

The conjugate gradient method is used to solve for the update in iterative image reconstruction problems. For example, in the magnetic resonance imaging (MRI) contrast known as quantitative susceptibility mapping (QSM), the reconstructed image ${\displaystyle \chi }$ is iteratively solved for from magnetic field data ${\displaystyle {\textbf {b}}}$ by the relation[6]

 ${\displaystyle {\textbf {b}}={\textbf {D}}\chi }$


Where ${\displaystyle D}$ is the matrix expressing convolution with the dipole kernel in the Fourier domain. Given that the problem is ill-posed, a physical prior is used in the reconstruction, which is framed as a constrained L1 norm minimization

 ${\displaystyle min_{\chi }\left\|f(\chi )\right\|_{1}}$

 ${\displaystyle s.t.\left\|g(\chi )-c\right\|_{2}^{2}}$


A detailed treatment of the function ${\displaystyle f(\chi )}$ and ${\displaystyle g(\chi )}$ can be found at [6]. This problem can be expressed as an unconstrained minimization problem via the Lagrange Multiplier Method

 ${\displaystyle min_{\chi ,\lambda }E(\chi ,\lambda )}$


Where

 ${\displaystyle E(\chi ,\lambda )\equiv \left\|f(\chi )\right\|_{1}+\lambda (\left\|g(\chi )-c\right\|_{2}^{2}-\varepsilon )}$


The first-order conditions require ${\displaystyle \nabla _{\chi }E(\chi ,\lambda )=0}$ and ${\displaystyle \nabla _{\lambda }E(\chi ,\lambda )=0}$. These conditions result in ${\displaystyle \nabla _{\chi }f(\chi )+\nabla _{\chi }g(\chi )-{\tilde {c}}=0}$ and ${\displaystyle \left\|g(\chi )-c\right\|_{2}^{2}\approx \varepsilon }$, respectively. The update can be solved for ${\displaystyle \nabla _{\chi }f(\chi )+\nabla _{\chi }g(\chi )-{\tilde {c}}=L(\chi )\chi -{\tilde {c}}=0}$ via fixed point iteration [6].

 ${\displaystyle \chi _{n+1}=L^{-1}(\chi ^{n}){\tilde {c}}}$


And expressed as the quasi-Newton problem, more robust to round-off error [6]

 ${\displaystyle \chi _{n+1}=\chi _{n}-L^{-1}(\chi _{n})\nabla E(\chi _{n},\lambda )}$


Which is solved with the CG method until the residual ${\displaystyle \left\|\chi _{n+1}-\chi \right\|_{2}/\left\|\chi _{n}\right\|_{2}\leq \theta }$ where ${\displaystyle \theta }$ is a specified tolerance, such as ${\displaystyle 10^{-2}}$.

Additional L1 terms, such as a downsampled term [7] can be added, in which case the cost function is treated with penalty methods and the CG method is paired with Gauss-Newton to address nonlinear terms.

Facial recognition

Figure 3. The process of decomposing an image. DWT plays a significant role in reducing the dimension of an image and extract the features by decomposing an image in frequency domain into sub-bands at different scales. The DWT of an image is created as follows: in the first level of decomposition, the image is split into four sub-bands, namely HH1, HL1, LH1, and LL1, as shown in the figure. The HH1, HL1 and LH1 sub-bands represent the diagonal details, horizontal features and vertical structures of the image, respectively. The LL1 sub-band is the low resolution residual consisting of low frequency components and it is this sub-band which is further split at higher levels of decomposition.[8][9]

The realization of face recognition can be achieved by the implementation of conjugate gradient algorithms with the combination of other methods. The basic steps is to decompose images into a set of time-frequency coefficients using discrete wavelet transform (DWT) (Figure 3)[8]. Then use basic back propagation (BP) to train a neural network (NN). And to overcome the slow convergence of BP using the steepest gradient descent, conjugate gradient methods are introduced. Generally, there are four types of CG methods for training a feed-foward NN, namely, Fletcher-Reeves CG, Polak-Ribikre CG, Powell-Beale CG, and scaled CG. All the CG methods include the steps demonstrated in Alg 3 with their respective modifications[8][10].

Notice that ${\displaystyle \beta _{k}}$ is a scalar that varies in different versions of CG methods. ${\displaystyle \beta _{k}}$ for Fletcher-Reeves CG is defined as the following:

 ${\displaystyle \beta _{i+1}=-{\frac {{\textbf {g}}_{i}^{T}{\textbf {g}}_{i}}{{\textbf {g}}_{i-1}^{T}{\textbf {g}}_{i-1}}}}$


where ${\displaystyle {\textbf {g}}_{i-1}}$ and ${\displaystyle {\textbf {g}}_{i}}$ are the previous and current gradients, respectively. Fletcher Reeves suggested restarting the CG after every n or n+1 iterations. This suggestion can be very helpful in practice[10].

The second version of the CG, namely, Polak-Ribiere CG, was proposed in 1977[11]. In this algorithm, the scalar at each iteration is computed as[12]:

 ${\displaystyle \beta _{i+1}=-{\frac {\Delta {\textbf {g}}_{i-1}^{T}{\textbf {g}}_{i}}{{\textbf {g}}_{i-1}^{T}{\textbf {g}}_{i-1}}}}$


Powell has showed that Polak-Ribiere CG method has better performance compared with Fletcher-Reeves CG method in many numerical experiments[10].

The mentioned CGs, Fletcher-Reeves and Polak-Ribitre algorithms, periodically restart by using the steepest descent direction every ${\displaystyle n}$ or ${\displaystyle n+1}$ iterations. A disadvantage of the periodical restarting is that the immediate reduction in the objective function is usually less than it would be without restart. Therefore, Powell proposed a new method of restart based on an earlier version[10][12]. As stated by this method, when there is negligibility orthogonality left between the current gradient and the previous gradient (the following condition is satisfied), the restart occurs[10][12].

Another version of the CG, scaled CG, was proposed by Moller. This algorithm uses a step size scaling mechanism that avoids a time consuming line-search per learning iteration. Moller[13] has showed that the scaled CG method has superlinear convergence for most problems.

Conjugate gradient method in deep learning

Figure 4. Activity diagram for the training process with the conjugate gradient.[14]

The conjugate gradient method introduced hyperparameter optimization in deep learning algorithm can be regarded as something intermediate between gradient descent and Newton's method, which does not require storing, evaluating, and inverting the Hessian matrix, as it does Newton's method. Optimization search in conjugate gradient is performed along with conjugate directions. They generally produce faster convergence than gradient descent directions. Training directions are conjugated concerning the Hessian matrix.

Let’s denote ${\displaystyle {\textbf {d}}}$ as the training direction vector. Starting with an initial parameter vector ${\displaystyle {\textbf {w}}_{0}}$ and an initial training direction vector ${\displaystyle {\textbf {d}}_{0}=-{\textbf {g}}_{0}}$, the conjugate gradient method constructs a sequence of training direction as:

 ${\displaystyle {\textbf {d}}_{i+1}={\textbf {g}}_{i+1}+\gamma _{i}{\textbf {d}}_{i}\;\;\;\;\;\;\;for\;i=0,1,...}$


Here ${\displaystyle \gamma }$ is called the called the conjugate parameter. For all conjugate gradient algorithms, the training direction is periodically reset to the negative of the gradient. The parameters are then improved according to the following expression:

 ${\displaystyle {\textbf {w}}_{i+1}={\textbf {w}}_{i}+\eta _{i}{\textbf {d}}_{i}\;\;\;\;\;\;\;for\;i=0,1,...}$


The training rate ${\displaystyle \eta }$ is usually found by line minimization. Figure 4 depicts an activity diagram for the training process with the conjugate gradient. To improve the parameters, first compute the conjugate gradient training direction. Then, search for a suitable training rate in that direction. This method has proved to be more effective than gradient descent in training neural networks. Also, conjugate gradient performs well with vast neural networks since it does not require the Hessian matrix.

Conclusion

The conjugate gradient method was invented to avoid the high computational cost of Newton’s method and to accelerate the convergence rate of steepest descent. As an iterative method, each step only requires ${\displaystyle {\textbf {A}}{\textbf {d}}_{i}}$ multiplication free from the storage of matrix ${\displaystyle {\textbf {A}}}$. And selected direction vectors are treated as a conjugate version of the successive gradients obtained while the method progresses. So it monotonically improves approximations ${\displaystyle {\textbf {x}}_{k}}$ to the exact solution and may reach the required tolerance after a relatively small (compared to the problem size) number of iterations in the absence of round-off error, which makes it widely used for solving large and sparse problems. Because of the high flexibility of the method framework, variants of CG algorithms have been proposed and can be applied to a variety of applications in different fields, such as machine learning and deep learning, in order to enhance the algorithm performance.

Reference

1. Refsnæs, Runar Heggelien. “A Brief introduction to the conjugate gradient method.” (2009).