Conjugate gradient methods - Revision history

Agr78 at 17:15, 13 February 2024

2024-02-13T17:15:33Z

← Older revision		Revision as of 13:15, 13 February 2024
Line 10:		Line 10:
	The conjugate gradient method is often implemented as an iterative algorithm and can be considered as being between [https://en.wikipedia.org/wiki/Newton%27s_method Newton’s method], a second-order method that incorporates Hessian and gradient, and the method of steepest descent, a first-order method that uses gradient. Newton's Method usually reduces the number of iterations needed, but the calculation of the Hessian matrix and its inverse increases the computation required for each iteration. Steepest descent takes repeated steps in the opposite direction of the gradient of the function at the current point. It often takes steps in the same direction as earlier ones, resulting in slow convergence (Figure 1). To avoid the high computational cost of Newton’s method and to accelerate the convergence rate of steepest descent, the conjugate gradient method was developed.<br><br>		The conjugate gradient method is often implemented as an iterative algorithm and can be considered as being between [https://en.wikipedia.org/wiki/Newton%27s_method Newton’s method], a second-order method that incorporates Hessian and gradient, and the method of steepest descent, a first-order method that uses gradient. Newton's Method usually reduces the number of iterations needed, but the calculation of the Hessian matrix and its inverse increases the computation required for each iteration. Steepest descent takes repeated steps in the opposite direction of the gradient of the function at the current point. It often takes steps in the same direction as earlier ones, resulting in slow convergence (Figure 1). To avoid the high computational cost of Newton’s method and to accelerate the convergence rate of steepest descent, the conjugate gradient method was developed.<br><br>

	The idea of the CG method is to pick <math>n</math> orthogonal search directions first and, in each search direction, take exactly one step such that the step size is to the proposed solution <math>x</math> at that direction. The solution is reached after <math>n</math> steps <ref name = "foo1">W. Stuetzle, “The Conjugate Gradient Method.” 2001. [Online]. Available: https://sites.stat.washington.edu/wxs/Stat538-w03/conjugate-gradients.pdf		The idea of the CG method is to pick <math>n</math> orthogonal search directions first and, in each search direction, take exactly one step such that the step size is orthogonal to the proposed solution <math>x</math> at that direction. The solution is reached after <math>n</math> steps <ref name = "foo1">W. Stuetzle, “The Conjugate Gradient Method.” 2001. [Online]. Available: https://sites.stat.washington.edu/wxs/Stat538-w03/conjugate-gradients.pdf
	</ref> as, theoretically, the number of iterations needed by the CG method is equal to the number of different eigenvalues of <math>\textbf{A}</math>, i.e. at most <math>n</math>. This makes it attractive for large and sparse problems. The method can be used to solve least-squares problems and can also be generalized to a minimization method for general smooth functions <ref name = "foo1" />.		</ref> as, theoretically, the number of iterations needed by the CG method is equal to the number of different eigenvalues of <math>\textbf{A}</math>, i.e. at most <math>n</math>. This makes it attractive for large and sparse problems. The method can be used to solve least-squares problems and can also be generalized to a minimization method for general smooth functions <ref name = "foo1" />.

MiScott1601: /* Conjugate gradient method in deep learning */

2021-12-12T01:16:41Z

Conjugate gradient method in deep learning

← Older revision		Revision as of 21:16, 11 December 2021
Line 310:		Line 310:

	===Conjugate gradient method in deep learning===		===Conjugate gradient method in deep learning===
			[[File:WX20211211-201145@2x.png\|thumb\|Figure 4. Activity diagram for the training process with the conjugate gradient.<ref>Quesada and Artelnics, “5 Algorithms to Train a Neural Network.” https://www.neuraldesigner.com/blog/5_algorithms_to_train_a_neural_network</ref>]]

	The conjugate gradient method introduced hyperparameter optimization in deep learning algorithm can be regarded as something intermediate between gradient descent and Newton's method, which does not require storing, evaluating, and inverting the Hessian matrix, as it does Newton's method. Optimization search in conjugate gradient is performed along with conjugate directions. They generally produce faster convergence than gradient descent directions. Training directions are conjugated concerning the Hessian matrix.<br><br>		The conjugate gradient method introduced hyperparameter optimization in deep learning algorithm can be regarded as something intermediate between gradient descent and Newton's method, which does not require storing, evaluating, and inverting the Hessian matrix, as it does Newton's method. Optimization search in conjugate gradient is performed along with conjugate directions. They generally produce faster convergence than gradient descent directions. Training directions are conjugated concerning the Hessian matrix.<br><br>

Line 318:		Line 320:
	<math>\textbf{w}_{i+1} = \textbf{w}_i + \eta_i\textbf{d}_i\;\;\;\;\;\;\;for\;i=0,1,...</math>		<math>\textbf{w}_{i+1} = \textbf{w}_i + \eta_i\textbf{d}_i\;\;\;\;\;\;\;for\;i=0,1,...</math>

	The training rate <math>\eta</math> is usually found by line minimization. ~~The picture below~~ depicts an activity diagram for the training process with the conjugate gradient. To improve the parameters, we first compute the conjugate gradient training direction. Then, we search for a suitable training rate in that direction.		The training rate <math>\eta</math> is usually found by line minimization. Figure 4 depicts an activity diagram for the training process with the conjugate gradient. To improve the parameters, first compute the conjugate gradient training direction. Then, search for a suitable training rate in that direction. This method has proved to be more effective than gradient descent in training neural networks. Also, conjugate gradient performs well with vast neural networks since it does not require the Hessian matrix.

	== Conclusion ==		== Conclusion ==

MiScott1601: /* Conjugate gradient method in deep learning */

2021-12-12T01:07:38Z

Conjugate gradient method in deep learning

← Older revision		Revision as of 21:07, 11 December 2021
Line 310:		Line 310:

	===Conjugate gradient method in deep learning===		===Conjugate gradient method in deep learning===
			The conjugate gradient method introduced hyperparameter optimization in deep learning algorithm can be regarded as something intermediate between gradient descent and Newton's method, which does not require storing, evaluating, and inverting the Hessian matrix, as it does Newton's method. Optimization search in conjugate gradient is performed along with conjugate directions. They generally produce faster convergence than gradient descent directions. Training directions are conjugated concerning the Hessian matrix.<br><br>

			Let’s denote <math>\textbf{d}</math> as the training direction vector. Starting with an initial parameter vector <math>\textbf{w}_0</math> and an initial training direction vector <math>\textbf{d}_0 = -\textbf{g}_0</math>, the conjugate gradient method constructs a sequence of training direction as:
			<math>\textbf{d}_{i+1} = \textbf{g}_{i+1} + \gamma_i\textbf{d}_i\;\;\;\;\;\;\;for\;i=0,1,...</math>

			Here <math>\gamma</math> is called the called the conjugate parameter. For all conjugate gradient algorithms, the training direction is periodically reset to the negative of the gradient. The parameters are then improved according to the following expression:
			<math>\textbf{w}_{i+1} = \textbf{w}_i + \eta_i\textbf{d}_i\;\;\;\;\;\;\;for\;i=0,1,...</math>

			The training rate <math>\eta</math> is usually found by line minimization. The picture below depicts an activity diagram for the training process with the conjugate gradient. To improve the parameters, we first compute the conjugate gradient training direction. Then, we search for a suitable training rate in that direction.

	== Conclusion ==		== Conclusion ==

Agr78: /* Iterative image reconstruction */

2021-12-11T15:28:14Z

Iterative image reconstruction

← Older revision		Revision as of 11:28, 11 December 2021
Line 285:		Line 285:
	Which is solved with the CG method until the residual <math>\left\\|\chi_{n+1}-\chi\right\\|_2/\left\\|\chi_n\right\\|_2\leq \theta </math> where <math>\theta</math> is a specified tolerance, such as <math>10^{-2}</math>.<br>		Which is solved with the CG method until the residual <math>\left\\|\chi_{n+1}-\chi\right\\|_2/\left\\|\chi_n\right\\|_2\leq \theta </math> where <math>\theta</math> is a specified tolerance, such as <math>10^{-2}</math>.<br>

	Additional L1 terms, such as a downsampled term <ref name = "foo44">A. Roberts, P. Spincemaille, T. Nguyen, Y. Wang, “International Society for Magnetic Resonance in Medicine,” in MEDI-d: Downsampled Morphological Priors for Shadow Reduction in Quantitative Susceptibility Mapping, 2021.</ref> can be added, in which case the cost function is treated with penalty methods and the CG method is ~~also used~~.		Additional L1 terms, such as a downsampled term <ref name = "foo44">A. Roberts, P. Spincemaille, T. Nguyen, Y. Wang, “International Society for Magnetic Resonance in Medicine,” in MEDI-d: Downsampled Morphological Priors for Shadow Reduction in Quantitative Susceptibility Mapping, 2021.</ref> can be added, in which case the cost function is treated with penalty methods and the CG method is paired with Gauss-Newton to address nonlinear terms.

	===Facial recognition===		===Facial recognition===

Agr78: /* Iterative image reconstruction */

2021-12-11T15:25:30Z

Iterative image reconstruction

← Older revision		Revision as of 11:25, 11 December 2021
Line 257:		Line 257:
	Conjugate gradient methods have often been used to solve a wide variety of numerical problems, including linear and nonlinear algebraic equations, eigenvalue problems and minimization problems. These applications have been similar in that they involve large numbers of variables or dimensions. In these circumstances any method of solution which involves storing a full matrix of this large order, becomes inapplicable. Thus recourse to the conjugate gradient method may be the only alternative <ref>R. Fletcher, “Conjugate gradient methods for indefinite systems,” in Numerical Analysis, Berlin, Heidelberg, 1976, pp. 73–89. doi: 10.1007/BFb0080116.</ref>. Here we demonstrate three application examples of CG method.		Conjugate gradient methods have often been used to solve a wide variety of numerical problems, including linear and nonlinear algebraic equations, eigenvalue problems and minimization problems. These applications have been similar in that they involve large numbers of variables or dimensions. In these circumstances any method of solution which involves storing a full matrix of this large order, becomes inapplicable. Thus recourse to the conjugate gradient method may be the only alternative <ref>R. Fletcher, “Conjugate gradient methods for indefinite systems,” in Numerical Analysis, Berlin, Heidelberg, 1976, pp. 73–89. doi: 10.1007/BFb0080116.</ref>. Here we demonstrate three application examples of CG method.
	===Iterative image reconstruction===		===Iterative image reconstruction===
	The conjugate gradient method is used to solve for the update in iterative image reconstruction problems. For example, in the magnetic resonance imaging (MRI) contrast known as quantitative susceptibility mapping (QSM), the reconstructed image <math>\chi</math> is iteratively solved for from magnetic field data <math>\textbf{b}</math> by the relation<ref name = "foo43">J. Liu et al., “Morphology enabled dipole inversion for quantitative susceptibility mapping using structural consistency between the magnitude image and the susceptibility map,” NeuroImage, vol. 59, no. 3, pp. 2560–2568, Feb. 2012, doi: 10.1016/j.neuroimage.2011.08.082.</ref><br>		The conjugate gradient method is used to solve for the update in iterative image reconstruction problems. For example, in the magnetic resonance imaging (MRI) contrast known as quantitative susceptibility mapping (QSM), the reconstructed image <math>\chi</math> is iteratively solved for from magnetic field data <math>\textbf{b}</math> by the relation<ref name = "foo43">T. Liu et al., “Morphology enabled dipole inversion for quantitative susceptibility mapping using structural consistency between the magnitude image and the susceptibility map,” NeuroImage, vol. 59, no. 3, pp. 2560–2568, Feb. 2012, doi: 10.1016/j.neuroimage.2011.08.082.</ref><br>
	<math>\textbf{b}=\textbf{D}\chi</math>		<math>\textbf{b}=\textbf{D}\chi</math>

Line 284:		Line 284:

	Which is solved with the CG method until the residual <math>\left\\|\chi_{n+1}-\chi\right\\|_2/\left\\|\chi_n\right\\|_2\leq \theta </math> where <math>\theta</math> is a specified tolerance, such as <math>10^{-2}</math>.<br>		Which is solved with the CG method until the residual <math>\left\\|\chi_{n+1}-\chi\right\\|_2/\left\\|\chi_n\right\\|_2\leq \theta </math> where <math>\theta</math> is a specified tolerance, such as <math>10^{-2}</math>.<br>

			Additional L1 terms, such as a downsampled term <ref name = "foo44">A. Roberts, P. Spincemaille, T. Nguyen, Y. Wang, “International Society for Magnetic Resonance in Medicine,” in MEDI-d: Downsampled Morphological Priors for Shadow Reduction in Quantitative Susceptibility Mapping, 2021.</ref> can be added, in which case the cost function is treated with penalty methods and the CG method is also used.

	===Facial recognition===		===Facial recognition===

MiScott1601: /* Facial recognitionH. Azami, M. Malekzadeh, and S. Sanei, “A New Neural Network Approach for Face Recognition Based on Conjugate Gradient Algorithms and Principal Component Analysis,” Journal of Mathematics and Computer Science, vol. 6, no. 3, pp. 166–175, 2013, doi: 10.22436/jmcs.06.03.01. */

2021-12-11T08:32:34Z

Facial recognitionH. Azami, M. Malekzadeh, and S. Sanei, “A New Neural Network Approach for Face Recognition Based on Conjugate Gradient Algorithms and Principal Component Analysis,” Journal of Mathematics and Computer Science, vol. 6, no. 3, pp. 166–175, 2013, doi: 10.22436/jmcs.06.03.01.

← Older revision		Revision as of 04:32, 11 December 2021
Line 285:		Line 285:
	Which is solved with the CG method until the residual <math>\left\\|\chi_{n+1}-\chi\right\\|_2/\left\\|\chi_n\right\\|_2\leq \theta </math> where <math>\theta</math> is a specified tolerance, such as <math>10^{-2}</math>.<br>		Which is solved with the CG method until the residual <math>\left\\|\chi_{n+1}-\chi\right\\|_2/\left\\|\chi_n\right\\|_2\leq \theta </math> where <math>\theta</math> is a specified tolerance, such as <math>10^{-2}</math>.<br>

	===Facial recognition<ref name = "FR">H. Azami, M. Malekzadeh, and S. Sanei, “A New Neural Network Approach for Face Recognition Based on Conjugate Gradient Algorithms and Principal Component Analysis,” Journal of Mathematics and Computer Science, vol. 6, no. 3, pp. 166–175, 2013, doi: 10.22436/jmcs.06.03.01.</ref>===		===Facial recognition===
	[[File:WX20211211-012754@2x.png\|thumb\|Figure 3. The process of decomposing an image. DWT plays a significant role in reducing the dimension of an image and extract the features by decomposing an image in frequency domain into sub-bands at different scales. The DWT of an image is created as follows: in the first level of decomposition, the image is split into four sub-bands, namely HH1, HL1, LH1, and LL1, as shown in the figure. The HH1, HL1 and LH1 sub-bands represent the diagonal details, horizontal features and vertical structures of the image, respectively. The LL1 sub-band is the low resolution residual consisting of low frequency components and it is this sub-band which is further split at higher levels of decomposition.<ref name = "FR" /><ref>H. Azami, M. R. Mosavi, S. Sanei, Classification of GPS satellites using improved back propagation training algorithms, Wireless Personal Communications, Springer-Verlog, DOI 10.1007/s11277-012-0844-7 (2012)</ref>]]		[[File:WX20211211-012754@2x.png\|thumb\|Figure 3. The process of decomposing an image. DWT plays a significant role in reducing the dimension of an image and extract the features by decomposing an image in frequency domain into sub-bands at different scales. The DWT of an image is created as follows: in the first level of decomposition, the image is split into four sub-bands, namely HH1, HL1, LH1, and LL1, as shown in the figure. The HH1, HL1 and LH1 sub-bands represent the diagonal details, horizontal features and vertical structures of the image, respectively. The LL1 sub-band is the low resolution residual consisting of low frequency components and it is this sub-band which is further split at higher levels of decomposition.<ref name = "FR">H. Azami, M. Malekzadeh, and S. Sanei, “A New Neural Network Approach for Face Recognition Based on Conjugate Gradient Algorithms and Principal Component Analysis,” Journal of Mathematics and Computer Science, vol. 6, no. 3, pp. 166–175, 2013, doi: 10.22436/jmcs.06.03.01.</ref><ref>H. Azami, M. R. Mosavi, S. Sanei, Classification of GPS satellites using improved back propagation training algorithms, Wireless Personal Communications, Springer-Verlog, DOI 10.1007/s11277-012-0844-7 (2012)</ref>]]

	The realization of face recognition can be achieved by the implementation of conjugate gradient algorithms with the combination of other methods. The basic steps is to decompose images into a set of time-frequency coefficients using discrete wavelet transform (DWT) (Figure 3)<ref name = "FR" />. Then use basic back propagation (BP) to train a neural network (NN). And to overcome the slow convergence of BP using the steepest gradient descent, conjugate gradient methods are introduced. Generally, there are four types of CG methods for training a feed-foward NN, namely, Fletcher-Reeves CG, Polak-Ribikre CG, Powell-Beale CG, and scaled CG. All the CG methods include the steps demonstrated in ''Alg 3'' with their respective modifications<ref name = "FR" /><ref name = "4CGs">M. H. Shaheed, Performance analysis of 4 types of conjugate gradient algorithms in the nonlinear dynamic modelling of a TRMS using feedforward neural networks, IEEE Conference on Systems, Man and Cybernetics, (2004), 5985-5990</ref>.<br><br>		The realization of face recognition can be achieved by the implementation of conjugate gradient algorithms with the combination of other methods. The basic steps is to decompose images into a set of time-frequency coefficients using discrete wavelet transform (DWT) (Figure 3)<ref name = "FR" />. Then use basic back propagation (BP) to train a neural network (NN). And to overcome the slow convergence of BP using the steepest gradient descent, conjugate gradient methods are introduced. Generally, there are four types of CG methods for training a feed-foward NN, namely, Fletcher-Reeves CG, Polak-Ribikre CG, Powell-Beale CG, and scaled CG. All the CG methods include the steps demonstrated in ''Alg 3'' with their respective modifications<ref name = "FR" /><ref name = "4CGs">M. H. Shaheed, Performance analysis of 4 types of conjugate gradient algorithms in the nonlinear dynamic modelling of a TRMS using feedforward neural networks, IEEE Conference on Systems, Man and Cybernetics, (2004), 5985-5990</ref>.<br><br>

MiScott1601 at 08:23, 11 December 2021

2021-12-11T08:23:21Z

← Older revision		Revision as of 04:23, 11 December 2021
Line 1:		Line 1:
	Author: Alexandra Roberts, Anye Shi, Yue Sun (SYSEN 6800 Fall 2021)		Author: Alexandra Roberts, Anye Shi, Yue Sun (CHEME/SYSEN 6800, Fall 2021)

	== Introduction ==		== Introduction ==

MiScott1601: /* Conclusion */

2021-12-11T08:20:08Z

Conclusion

← Older revision		Revision as of 04:20, 11 December 2021
Line 311:		Line 311:
	== Conclusion ==		== Conclusion ==
	The conjugate gradient method was invented to avoid the high computational cost of Newton’s method and to accelerate the convergence rate of steepest descent. As an iterative method, each step only requires <math>\textbf{A}\textbf{d}_i</math> multiplication free from the storage of matrix <math>\textbf{A}</math>. And selected direction vectors are treated as a conjugate version of the successive gradients obtained while the method progresses. So it monotonically improves approximations <math>\textbf{x}		The conjugate gradient method was invented to avoid the high computational cost of Newton’s method and to accelerate the convergence rate of steepest descent. As an iterative method, each step only requires <math>\textbf{A}\textbf{d}_i</math> multiplication free from the storage of matrix <math>\textbf{A}</math>. And selected direction vectors are treated as a conjugate version of the successive gradients obtained while the method progresses. So it monotonically improves approximations <math>\textbf{x}
	_k</math> to the exact solution and may reach the required tolerance after a relatively small (compared to the problem size) number of iterations in the absence of [https://en.wikipedia.org/wiki/Round-off_error round-off error], which makes it widely used for solving large and sparse problems. Because of the high flexibility of the method framework, variants of CG algorithms have been proposed and can be applied to a variety of applications in different fields,such as machine learning and deep learning, in order to enhance the algorithm performance.		_k</math> to the exact solution and may reach the required tolerance after a relatively small (compared to the problem size) number of iterations in the absence of [https://en.wikipedia.org/wiki/Round-off_error round-off error], which makes it widely used for solving large and sparse problems. Because of the high flexibility of the method framework, variants of CG algorithms have been proposed and can be applied to a variety of applications in different fields, such as machine learning and deep learning, in order to enhance the algorithm performance.

	== Reference ==		== Reference ==

MiScott1601: /* Conclusion */

2021-12-11T08:19:51Z

Conclusion

← Older revision		Revision as of 04:19, 11 December 2021
Line 311:		Line 311:
	== Conclusion ==		== Conclusion ==
	The conjugate gradient method was invented to avoid the high computational cost of Newton’s method and to accelerate the convergence rate of steepest descent. As an iterative method, each step only requires <math>\textbf{A}\textbf{d}_i</math> multiplication free from the storage of matrix <math>\textbf{A}</math>. And selected direction vectors are treated as a conjugate version of the successive gradients obtained while the method progresses. So it monotonically improves approximations <math>\textbf{x}		The conjugate gradient method was invented to avoid the high computational cost of Newton’s method and to accelerate the convergence rate of steepest descent. As an iterative method, each step only requires <math>\textbf{A}\textbf{d}_i</math> multiplication free from the storage of matrix <math>\textbf{A}</math>. And selected direction vectors are treated as a conjugate version of the successive gradients obtained while the method progresses. So it monotonically improves approximations <math>\textbf{x}
	_k</math> to the exact solution and may reach the required tolerance after a relatively small (compared to the problem size) number of iterations in the absence of [https://en.wikipedia.org/wiki/Round-off_error round-off error], which makes it widely used for solving large and sparse problems. Because of the high flexibility of the method framework, variants of CG algorithms have been proposed and can be applied to a variety of applications in different fields in order to enhance the algorithm performance.		_k</math> to the exact solution and may reach the required tolerance after a relatively small (compared to the problem size) number of iterations in the absence of [https://en.wikipedia.org/wiki/Round-off_error round-off error], which makes it widely used for solving large and sparse problems. Because of the high flexibility of the method framework, variants of CG algorithms have been proposed and can be applied to a variety of applications in different fields,such as machine learning and deep learning, in order to enhance the algorithm performance.

	== Reference ==		== Reference ==

MiScott1601: /* Conclusion */

2021-12-11T08:18:46Z

Conclusion

← Older revision		Revision as of 04:18, 11 December 2021
Line 311:		Line 311:
	== Conclusion ==		== Conclusion ==
	The conjugate gradient method was invented to avoid the high computational cost of Newton’s method and to accelerate the convergence rate of steepest descent. As an iterative method, each step only requires <math>\textbf{A}\textbf{d}_i</math> multiplication free from the storage of matrix <math>\textbf{A}</math>. And selected direction vectors are treated as a conjugate version of the successive gradients obtained while the method progresses. So it monotonically improves approximations <math>\textbf{x}		The conjugate gradient method was invented to avoid the high computational cost of Newton’s method and to accelerate the convergence rate of steepest descent. As an iterative method, each step only requires <math>\textbf{A}\textbf{d}_i</math> multiplication free from the storage of matrix <math>\textbf{A}</math>. And selected direction vectors are treated as a conjugate version of the successive gradients obtained while the method progresses. So it monotonically improves approximations <math>\textbf{x}
	_k</math> to the exact solution and may reach the required tolerance after a relatively small (compared to the problem size) number of iterations in the absence of [https://en.wikipedia.org/wiki/Round-off_error round-off error], which makes it widely used for solving large and sparse problems. Because of the high flexibility of the method, variants of CG algorithms have been proposed and ~~have been~~ applied to a variety of applications in different fields in order to enhance the algorithm performance.		_k</math> to the exact solution and may reach the required tolerance after a relatively small (compared to the problem size) number of iterations in the absence of [https://en.wikipedia.org/wiki/Round-off_error round-off error], which makes it widely used for solving large and sparse problems. Because of the high flexibility of the method framework, variants of CG algorithms have been proposed and can be applied to a variety of applications in different fields in order to enhance the algorithm performance.

	== Reference ==		== Reference ==