Bayesian Optimization: Difference between revisions

From Cornell University Computational Optimization Open Textbook - Optimization Wiki
Jump to navigation Jump to search
(references)
 
(45 intermediate revisions by the same user not shown)
Line 4: Line 4:


== Introduction ==
== Introduction ==
In black-box optimization we aim to solve the problem min<sub>{x∈Ω}</sub> <math>f(x)</math>, where <math> f </math> is a computationally expensive black-box function and the domain Ω is commonly a hyper-rectangle. Due to the fact that evaluations are computationally expensive, the number of evaluations of <math>f</math> to a few hundred. In the black-box setting, no additional information is known about <math> f </math> and we observe no first- or second-order information when evaluating  <math> f </math> . This is commonly referred to as derivative-free optimization.<ref>R. Turner, D. Eriksson, M. Mccourt, and I. Guyon, “Bayesian Optimization is Superior to Random Search for Machine Learning Hyperparameter Tuning : arXiv : 2104 . 10201v2 [ cs . LG ] 31 Aug 2021,” 2020.</ref>
In black-box optimization the goal is to solve the problem min<sub>{x∈Ω}</sub> <math>f(x)</math>, where <math> f </math> is a computationally expensive black-box function and the domain Ω is commonly a hyper-rectangle. Due to the fact that evaluations are computationally expensive, the goal is to reduce the number of evaluations of <math>f</math> to a few hundred. In the black-box setting, no additional information is known about <math> f </math>. Hence, there is no observation of first- or second-order information when evaluating  <math> f </math> . This is commonly referred to as derivative-free optimization.<ref>R. Turner, D. Eriksson, M. Mccourt, and I. Guyon, “Bayesian Optimization is Superior to Random Search for Machine Learning Hyperparameter Tuning : arXiv : 2104 . 10201v2 [ cs . LG ] 31 Aug 2021,” 2020.</ref>


Black-box optimization problems of this form appear everywhere. Most machine learning (ML) models have hyperparameters that require tuning via black-box (i.e., derivative-free) optimization<ref>J.Snoek,H.Larochelle,andR.P.Adams.PracticalBayesianoptimizationofmachinelearning algorithms. In Advances in Neural Information Processing Systems 25, pages 2960–2968, 2012.</ref>.These black-box optimization problems can be solved using ''Bayesian Optimization (BO)'' methods''. BO'' rely on a surrogate model for the objective function that provides a measure of uncertainty.  
Black-box optimization problems of this form appear everywhere. Most machine learning (ML) models have hyperparameters that require tuning via black-box (i.e., derivative-free) optimization<ref>J.Snoek,H.Larochelle,andR.P.Adams.
 
PracticalBayesianoptimizationofmachinelearning algorithms. In Advances in Neural Information Processing Systems 25, pages 2960–2968, 2012.</ref>.These black-box optimization problems can be solved using ''Bayesian Optimization (BO)'' methods''. BO'' rely on a surrogate model for the objective function that provides a measure of uncertainty.  


Moreover many non Machine Learning based methods also benefit from the use of BO. For example, material discovery, manufacturing design are also effective uses of BO.  
Moreover many non Machine Learning based methods also benefit from the use of BO. For example, material discovery, manufacturing design are also effective uses of BO.  
Line 16: Line 18:
The other word for the probabilistic model is called as the ''surrogate function'' or the ''posterior distribution <ref>E. Brochu, V. M. Cora, and N. De Freitas, “arXiv : 1012 . 2599v1 [ cs . LG ] 12 Dec 2010 A Tutorial on Bayesian Optimization of Expensive Cost Functions , with Application to Active User Modeling and Hierarchical Reinforcement Learning,” 2010.</ref>.'' ''The posterior'' captures the updated belief about the unknown objective function. A better interpretation of this is to estimate the objective the function with a surrogate function. Common Surrogate function used to model is the Gaussian Process (GP). Other probabilistic models are Random Forests<ref>F. Hutter, H. H. Hoos, and K. Leyton-brown, “Sequential Model-Based Optimization for General Algorithm Configuration ( extended version ).”</ref>, and Tree Parzen Estimators <ref>J. Bergstra, “Algorithms for Hyper-Parameter Optimization,” pp. 1–9.</ref>.
The other word for the probabilistic model is called as the ''surrogate function'' or the ''posterior distribution <ref>E. Brochu, V. M. Cora, and N. De Freitas, “arXiv : 1012 . 2599v1 [ cs . LG ] 12 Dec 2010 A Tutorial on Bayesian Optimization of Expensive Cost Functions , with Application to Active User Modeling and Hierarchical Reinforcement Learning,” 2010.</ref>.'' ''The posterior'' captures the updated belief about the unknown objective function. A better interpretation of this is to estimate the objective the function with a surrogate function. Common Surrogate function used to model is the Gaussian Process (GP). Other probabilistic models are Random Forests<ref>F. Hutter, H. H. Hoos, and K. Leyton-brown, “Sequential Model-Based Optimization for General Algorithm Configuration ( extended version ).”</ref>, and Tree Parzen Estimators <ref>J. Bergstra, “Algorithms for Hyper-Parameter Optimization,” pp. 1–9.</ref>.


Here in the Gaussian Process will be described in detail:
Here in the Gaussian Process is described in detail:


===== Gaussian Process[GP] =====
===== Gaussian Process[GP] =====
Gaussian Process can be described as a collection of random variables, where any finite number of these are jointly normally distributed. It is defined by a mean function <math> m[x] </math> and a covariance function <math> k[x,x'] </math> that returns similarity among two points. The Gaussian process models the objective function as: <math> f[x] \sim GP[m[x],k[x,x']] </math> we are saying that:  
Gaussian Process can be described as a collection of random variables, where any finite number of the function <math> f </math> are jointly normally distributed. It is defined by a mean function <math> m[x] </math> and a covariance function <math> k[x,x'] </math> that returns similarity among two points. The Gaussian process models the objective function as: <math> f[x] \sim GP[m[x],k[x,x']] </math> which states that:  




Line 25: Line 27:




Given observations <math> f= [f[x_1],f[x_2],\dotsc,f[x_t]] </math> at <math> t </math> points, we would like to make a prediction about the function value at a new point <math> x^\star </math>. This new function value <math>f^\star=f[x^\star]</math> is jointly normally distributed with the observations <math> f </math> so that:
Given observations <math> f= [f[x_1],f[x_2],\dotsc,f[x_t]] </math> at <math> t </math> points, the objective is to make prediction about the function value at a new point <math> x^\star </math>. This new function value <math>f^\star=f[x^\star]</math> is jointly normally distributed with the observations <math> f </math> so that:


<math>\mathbf{Pr}\begin{pmatrix}
<math>\mathbf{Pr}\begin{pmatrix}
Line 46: Line 48:
where <math> K[X,X] </math> is a <math> t\times t </math> matrix where element <math> (i,j) </math> is given by <math> k[xi,xj], K[X,x^\star] </math> is a <math> t\times 1 </math> vector where element <math> i </math> is given by <math> k[x_i,x^\star] </math> and so on.
where <math> K[X,X] </math> is a <math> t\times t </math> matrix where element <math> (i,j) </math> is given by <math> k[xi,xj], K[X,x^\star] </math> is a <math> t\times 1 </math> vector where element <math> i </math> is given by <math> k[x_i,x^\star] </math> and so on.


Since the function values in equation above are jointly normal, the conditional distribution <math> Pr(f^\star \vert f) </math>  must also be normal, and we can use the standard formula for the mean and variance of this conditional distribution:
Since the function values in equation above are jointly normal, the conditional distribution <math> Pr(f^\star \vert f) </math>  must also be normal, and thereby the standard formula for the mean and variance of this conditional distribution becomes:


<math> Pr(f^\star \vert f)=Norm[\mu [x^\star], \sigma^2 [x^ \star]] </math>
<math> Pr(f^\star \vert f)=Norm[\mu [x^\star], \sigma^2 [x^ \star]] </math>
Line 56: Line 58:
<math> \sigma^2[x^\star] </math> =<math> K[x^\star,x^\star] </math> − <math> K[x^\star,X] </math> <math> [K[X,X]]^{-1}</math> <math> K[X,x^\star] </math>
<math> \sigma^2[x^\star] </math> =<math> K[x^\star,x^\star] </math> − <math> K[x^\star,X] </math> <math> [K[X,X]]^{-1}</math> <math> K[X,x^\star] </math>


Using this formula, we can estimate the distribution of the function at any new point <math> x^\star </math>. The best estimate of the function value is given by the mean <math> \mu[x] </math> , and the uncertainty is given by the variance <math>\sigma^2[x] </math>.
Using this formula, the distribution of the function at any new point <math> x^\star </math> can be estimated.The best estimate of the function value is given by the mean <math> \mu[x] </math> , and the uncertainty is given by the variance <math>\sigma^2[x] </math>.


====Acquisition Function====
====Acquisition Function====
The role of the acquisition function is to guide the search for the optimum<ref>E. Brochu, V. M. Cora, and N. De Freitas, “arXiv : 1012 . 2599v1 [ cs . LG ] 12 Dec 2010 A Tutorial on Bayesian Optimization of Expensive Cost Functions , with Application to Active User Modeling and Hierarchical Reinforcement Learning,” 2010.</ref>. Typically, acquisition functions are defined such that high acquisition corresponds to potentially high values of the objective function, whether because the prediction is high, the uncertainty is great, or both. Maximizing the acquisition function is used to select the next point at which to evaluate the function. That is, we wish to sample '''ƒ''' at '''argmax<sub>x</sub>u(x|D),''' where '''u(·)''' is the generic symbol for an acquisition function.
The second component of Bayesian Optimization is the Acquistion Function.The role of the acquisition function is to guide the search for the optimum<ref>E. Brochu, V. M. Cora, and N. De Freitas, “arXiv : 1012 . 2599v1 [ cs . LG ] 12 Dec 2010 A Tutorial on Bayesian Optimization of Expensive Cost Functions , with Application to Active User Modeling and Hierarchical Reinforcement Learning,” 2010.</ref>. Typically, acquisition functions are defined such that high acquisition corresponds to potentially high values of the objective function, whether because the prediction is high, the uncertainty is great, or both. Maximizing the acquisition function is used to select the next point at which to evaluate the function. Hence the main goal is to sample <math> f </math> at <math> \arg\max </math> <math>x_{u(x|D)}</math> , where <math> u </math>  is the generic symbol for an acquisition function.  
 
A good acquisition function should trade off exploration and exploitation. Two typically used acquisition functions are the ''Probability of Improvement (PI)''  and the ''Expected Improvement (EI)'' . In PI, the probability that sampling a given data-point, x, improves over the current best observation is maximized. One problem with the approach taken in PI is that it will, by its nature, prefer a point with a small but certain improvement over one which offers a far greater improvement, but at a slightly higher risk. In order to combat this effect, EI proposes maximizing the expected improvement over the current best known point.<ref>D. Jasrasaria and E. O. Pyzer-knapp, “Dynamic Control of Explore / Exploit Trade-Off In Bayesian Optimization,” no. July, 2018.</ref>EI has been shown to have strong theoretical guarantees and empirical effectiveness<ref>B. J. Snoek, H. Larochelle, and R. P. Adams, “PRACTICAL BAYESIAN OPTIMIZATION OF MACHINE LEARNING,” pp. 1–12.</ref> , and is often used as the acquistion function as a baseline.


this study as the baseline.
A good acquisition function should trade off exploration and exploitation. Two typically used acquisition functions are the ''Probability of Improvement (PI)''  and the ''Expected Improvement (EI)'' . In PI, the probability that sampling a given data-point, <math> x </math>, improves over the current best observation is maximized. One problem with the approach taken in PI is that it will, by its nature, prefer a point with a small but certain improvement over one which offers a far greater improvement, but at a slightly higher risk. In order to combat this effect, EI proposes maximizing the expected improvement over the current best known point.<ref>D. Jasrasaria and E. O. Pyzer-knapp, “Dynamic Control of Explore / Exploit Trade-Off In Bayesian Optimization,” no. July, 2018.</ref>EI has been shown to have strong theoretical guarantees and empirical effectiveness<ref>B. J. Snoek, H. Larochelle, and R. P. Adams, “PRACTICAL BAYESIAN OPTIMIZATION OF MACHINE LEARNING,” pp. 1–12.</ref> , and is often used as the acquistion function as a baseline. The formulations of common acquisition functions, Upper Bound Confidence, Probability of Improvement and Expected Improvement are defined below:


A quick section of common acquisition functions are defined below, and the figure below demonstrates the four different acquistion functons.
[[File:Four Acquistion functions.jpg|center|Acquistion_Functions|alt=|frameless]]
=====Upper Bound Confidence=====
=====Upper Bound Confidence=====
This acquisition function is defined as:
This acquisition function is defined as:
Line 73: Line 71:


=====Probability of Improvement=====
=====Probability of Improvement=====
This acquisition function computes the likelihood that the function at x∗ will return a result higher than the current maximum f[^x]. For each point x∗, we integrate the part of the associated normal distribution that is above the current maximum (figure 4b) so that:
This acquisition function computes the likelihood that the function at x∗ will return a result higher than the current maximum <math> f[\hat{x}] </math> . For each point x∗, the goal is to integrate the part of the associated normal distribution that is above the current maximum (figure 4b) such that:


<math> PI[x^\star] </math> = <math> \int_{f[x]}^{\infty} </math> <math>Norm_{f[x^\star]}</math> <math>[\mu[x^\star]],\sigma[x^\star]]</math> <math> df[x^\star]</math>
<math> PI[x^\star] </math> = <math> \int_{f[x]}^{\infty} </math> <math>Norm_{f[x^\star]}</math> <math>[\mu[x^\star]],\sigma[x^\star]]</math> <math> df[x^\star]</math>


=====Expected Improvement=====
=====Expected Improvement=====
The expected Improvement can be model as:
The expected Improvement can be modelled as:


<math> EI[x^\star] </math> = <math> \int_{f_[\hat{x}]}^{\infty} </math> <math> (f[x^\star]-f[\hat{x}]) </math> <math> Norm_{f[\hat{x}]}</math> <math>[\mu[x^\star],\sigma[x^\star]] </math> <math>df[x^\star] </math>
<math> EI[x^\star] </math> = <math> \int_{f[\hat{x}]}^{\infty} </math> <math> (f[x^\star]-f[\hat{x}]) </math> <math> Norm_{f[\hat{x}]}</math> <math>[\mu[x^\star],\sigma[x^\star]] </math> <math>df[x^\star] </math>


===Algorithm===
===Algorithm===
Line 102: Line 100:
*'''Materials''' - Bayesian Optimization is applied to design experiments of previous experiments to suggest features of the new materials whose performance is evaluated by simulation. Final goal is to improvise the design while reducing the number of experimentation. Bayesian Optimization are used to predict compounds with low thermal conductivity <ref>A. Seko, H. Hayashi, K. Tsuda, L. Chaput, and I. Tanaka, “Prediction of Low-Thermal-Conductivity Compounds with First-Principles Anharmonic Lattice-Dynamics Calculations and Bayesian Optimization,” doi: 10.1103/PhysRevLett.115.205901.</ref>, optimal melting temperature <ref> A. Seko, T. Maekawa, K. Tsuda, and I. Tanaka, “to melting temperatures of single and binary component solids,” pp. 1–10, 2018.</ref>, elastic properties<ref>D. Xue, P. V Balachandran, J. Hogden, J. Theiler, D. Xue, and T. Lookman, “properties by adaptive design,” pp. 1–9, 2016, doi: 10.1038/ncomms11241.</ref>
*'''Materials''' - Bayesian Optimization is applied to design experiments of previous experiments to suggest features of the new materials whose performance is evaluated by simulation. Final goal is to improvise the design while reducing the number of experimentation. Bayesian Optimization are used to predict compounds with low thermal conductivity <ref>A. Seko, H. Hayashi, K. Tsuda, L. Chaput, and I. Tanaka, “Prediction of Low-Thermal-Conductivity Compounds with First-Principles Anharmonic Lattice-Dynamics Calculations and Bayesian Optimization,” doi: 10.1103/PhysRevLett.115.205901.</ref>, optimal melting temperature <ref> A. Seko, T. Maekawa, K. Tsuda, and I. Tanaka, “to melting temperatures of single and binary component solids,” pp. 1–10, 2018.</ref>, elastic properties<ref>D. Xue, P. V Balachandran, J. Hogden, J. Theiler, D. Xue, and T. Lookman, “properties by adaptive design,” pp. 1–9, 2016, doi: 10.1038/ncomms11241.</ref>


==Example==
== Numerical Example==
Here using the HyperOpt package <ref>Bergstra, J., Yamins, D., Cox, D. D. (2013) Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. To appear in Proc. of the 30th International Conference on Machine Learning (ICML 2013).</ref>; a simple optimization problem will be solved: The example is broken down into four steps:
Here using the HyperOpt package <ref>Bergstra, J., Yamins, D., Cox, D. D. (2013) Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. To appear in Proc. of the 30th International Conference on Machine Learning (ICML 2013).</ref>; a simple optimization problem will be solved: The example is broken down into four steps:
===Problem Set Up ===
The minimization problem can be defined as:
<math> x^\star = \arg \min f(x) </math>
In this problem <math> f </math> is a fixed objective function that is evaluated pointwise. The first assumption is that there is no access to the gradient of <math> f </math>. A second assumption is the evaluations of <math> f </math> is noisy.
The problem is solved by using Bayesian Optimization, and defining the prior, a posterior to derive an acquisition function at each time step. In order to solve using BO, a sequence of points <math> {x_n} </math> will be constructed that converge to <math> x^\star </math>.
===Objective Function ===
The objective function for this example is a simple function defined below. The objective function has both a local and a global minimum as seen in Figure 1. The global minimum is at <math> x^\star = 0.75725 </math>
<math> f(x) = (6x - 2)^2 </math> <math> sin(12x-4) </math> ,
<math>  x  \epsilon  [0,1]  </math>
[[File:B6B598F0-7339-48F5-8799-C2384A243D24.jpg|thumb|'''Figure 1''' : Objective Function :
<math> f(x) = (6x - 2)^2 </math> <math> sin(12x-4) </math> ]]
===Setting a Gaussian Process Prior===
As described earlier, the Gaussian Process is most commonly used as the prior, for Bayesian Optimization.
In the example: Given inputs <math> X </math> and the corresponding noise observations <math> y </math>, the model takes the form:
<math> f \sim MultivariateNormal(0,k(X,X)) </math>,
<math> y \sim f+ \epsilon </math>
The domain is the value of <math> x </math> over which the function is evaluated. Hence a uniform distribution is utilized in this case over the space of function defined.
In the equation below, <math> \epsilon </math> is i.i.d Gaussian noise and <math> k(X,X) </math> is a covariance matrix whose entries are given by <math> k(x, x^\prime) </math> for each pair of inputs <math> (x,x^\prime)</math>
===Define Acquistion Function===
For this problem, the acquisition function used is the Lower Confidence Bound acquisition function due to its simplicity and easy interpretability.


=====Objective Function =====
The Lower Confidence Bound acquisition function is defined as :
The objective function for this example is a simple function that is defined that returns the value that needs to be minimized.


A simple objective is used that returns a single real value number to minimize.  The code to plot the objective function was written in python.
<math> \alpha(x) = \mu(x) - \kappa\sigma(x) </math>
[[File:ObjectiveFunction.png|none|frame|Objective Function where the minimum occurs at x = 4.8779]]


===Domain===
Here in <math> \mu(x) </math> and <math> \sigma(x) </math> are the mean and square root variance of the posterior at point <math> x </math>
The domain is the value of x over which we evaluate this function. We can use a uniform distribution in our case over the space of function is defined.


===Run the Optimization===
===Results and Running the Optimization===
In order to run the optimization; 2000 runs of the minimization with the Tree Parzen Estimator Algorithm is done. The function fmin, as shown in the figure below; takes in exactly four parts; the objective, total number of runs, algorithm, and the trials.  The final result is printed as 4.8748151906. Hence reaching the optima.
Once the acquisition function and Prior is defined; the acquisition function is used to derive the next query point as follows :
[[File:Hyperparameter optiomizatin.jpg|none|thumb|389x389px|Code for Bayesian Optimization]]
<math> x_{n+1} = \arg \min \alpha(x) </math>
and then
Evaluate <math> f(x_{n+1})</math> and update the posterior, until convergence.


===Results===
Results for the first 8 iterations for the Gaussian Process Regression and Acquistion functions are shown in Figure 2 and Figure 3. As one of the assumptions stated earlier was that the observation contain noise, it is impossible to converge to the ground truth solution.However, as seen in Figure 2 an 3; with a limited budget of 8 evaluations; BO has converged close to the global minimum at <math> x^\star = 0.75725 </math>
From the results we can see that overtime the algorithm tends to values closer to our original global optimization solution of 4.9, but the points tend to cluster around the actual minimum as the algorithm progresses.
[[File:Screen Shot 2021-12-19 at 6.15.29 PM.png|thumb|'''Figure 2''': GPR and Acquistion Function plots for first four time steps]]
[[File:X values aftereach ieration.png|thumb|Results After each Iteration|alt=|none]]
[[File:Screen Shot 2021-12-19 at 6.15.44 PM.png|thumb|'''Figure 4''': GPR and Acquistion Function for next four iterations]]


==Bayesian Optimization Packages ==
==Bayesian Optimization Packages ==

Latest revision as of 20:23, 19 December 2021

Author : Deepa Korani

Steward : Fenqgi You

Introduction

In black-box optimization the goal is to solve the problem min{x∈Ω} , where is a computationally expensive black-box function and the domain Ω is commonly a hyper-rectangle. Due to the fact that evaluations are computationally expensive, the goal is to reduce the number of evaluations of to a few hundred. In the black-box setting, no additional information is known about . Hence, there is no observation of first- or second-order information when evaluating . This is commonly referred to as derivative-free optimization.[1]

Black-box optimization problems of this form appear everywhere. Most machine learning (ML) models have hyperparameters that require tuning via black-box (i.e., derivative-free) optimization[2].These black-box optimization problems can be solved using Bayesian Optimization (BO) methods. BO rely on a surrogate model for the objective function that provides a measure of uncertainty.

Moreover many non Machine Learning based methods also benefit from the use of BO. For example, material discovery, manufacturing design are also effective uses of BO.

Methodology

Bayesian Optimization Algorithm has two main components [3]:

Probabilistic Model

The other word for the probabilistic model is called as the surrogate function or the posterior distribution [4]. The posterior captures the updated belief about the unknown objective function. A better interpretation of this is to estimate the objective the function with a surrogate function. Common Surrogate function used to model is the Gaussian Process (GP). Other probabilistic models are Random Forests[5], and Tree Parzen Estimators [6].

Here in the Gaussian Process is described in detail:

Gaussian Process[GP]

Gaussian Process can be described as a collection of random variables, where any finite number of the function are jointly normally distributed. It is defined by a mean function and a covariance function that returns similarity among two points. The Gaussian process models the objective function as: which states that:



Given observations at points, the objective is to make prediction about the function value at a new point . This new function value is jointly normally distributed with the observations so that:


where is a matrix where element is given by is a vector where element is given by and so on.

Since the function values in equation above are jointly normal, the conditional distribution must also be normal, and thereby the standard formula for the mean and variance of this conditional distribution becomes:

where:

=

=

Using this formula, the distribution of the function at any new point can be estimated.The best estimate of the function value is given by the mean , and the uncertainty is given by the variance .

Acquisition Function

The second component of Bayesian Optimization is the Acquistion Function.The role of the acquisition function is to guide the search for the optimum[7]. Typically, acquisition functions are defined such that high acquisition corresponds to potentially high values of the objective function, whether because the prediction is high, the uncertainty is great, or both. Maximizing the acquisition function is used to select the next point at which to evaluate the function. Hence the main goal is to sample at , where is the generic symbol for an acquisition function.

A good acquisition function should trade off exploration and exploitation. Two typically used acquisition functions are the Probability of Improvement (PI) and the Expected Improvement (EI) . In PI, the probability that sampling a given data-point, , improves over the current best observation is maximized. One problem with the approach taken in PI is that it will, by its nature, prefer a point with a small but certain improvement over one which offers a far greater improvement, but at a slightly higher risk. In order to combat this effect, EI proposes maximizing the expected improvement over the current best known point.[8]EI has been shown to have strong theoretical guarantees and empirical effectiveness[9] , and is often used as the acquistion function as a baseline. The formulations of common acquisition functions, Upper Bound Confidence, Probability of Improvement and Expected Improvement are defined below:

Upper Bound Confidence

This acquisition function is defined as:

= +

Probability of Improvement

This acquisition function computes the likelihood that the function at x∗ will return a result higher than the current maximum . For each point x∗, the goal is to integrate the part of the associated normal distribution that is above the current maximum (figure 4b) such that:

=

Expected Improvement

The expected Improvement can be modelled as:

=

Algorithm

Here follows a longer example of mathematical-style pseudocode, for the Bayesian Optimization [10]:

1: Algorithm' Bayesian Optimization ' is
2:    for t == 1,2,.....do  
3:         Find xt by optimizing the the acquisition function over the GP: xt = argmaxxu(x|D1:t-1)
4:         Sample the objective function yt = f(xt) + εt
5:         Augment the data D1:t = {D1:t-1,(xt,yt)} and update the GP
6:    end for

Applications

Bayesian Optimization has shown a wide variety of interest in areas of data science and Chemical Engineering, Material Science domain. Certain application include; robotics, environmental monitoring, combinatorial optimization, adaptive Monte Carlo, reinforcement learning. [11] Some of the applications are described below in detail:

  • A/B Testing - Though the idea of A/B testing dates back to the early days of advertising in the form of so-called focus groups, the advent of the internet and smartphones has given web and app developers a new forum for implementing these tests at unprecedented scales. By redirecting small fractions of user traffic to experimental designs of an ad, app, game, or website, the developers can utilize noisy feedback to optimize any observable metric with respect to the product’s configuration. In fact, depending on the particular phase of a product’s life, new subscriptions may be more valuable than revenue or user retention, or vice versa; the click-through rate might be the relevant objective to optimize for an ad, whereas for a game it may be some measure of user engagement. The crucial problem is how to optimally query these subsets of users in order to find the best product with high probability within a predetermined query budget, or how to redirect traffic sequentially in order to optimize a cumulative metric while incurring the least opportunity cost.
  • Recommender Systems - Generally Recommender systems are used by online content providers to recommend products to their subscribers. Recommending the right product to subscribers is important in order to improvise the revenue in case for e-commerce sits, or consumption sites.
  • Natural Language Processing and Text- Bayesian Optimization has been applied to improve text extraction. [12]
  • Environmental Monitoring and Sensor Networks - Sensor Networks allow monitoring important environment conditions; such as temperature, concentration of pollutants in the atmosphere, soil and ocean. These networks make noise measurements that are interpolated to produce a global model of interest. In some cases, these sensors are expensive to activate but one can answer important questions like what is the hottest or coldest spot in a building by activating a relatively small number of sensors. Bayesian optimization was used for this task and the similar one of finding the location of greatest highway traffic congestion [13]
  • Materials - Bayesian Optimization is applied to design experiments of previous experiments to suggest features of the new materials whose performance is evaluated by simulation. Final goal is to improvise the design while reducing the number of experimentation. Bayesian Optimization are used to predict compounds with low thermal conductivity [14], optimal melting temperature [15], elastic properties[16]

Numerical Example

Here using the HyperOpt package [17]; a simple optimization problem will be solved: The example is broken down into four steps:

Problem Set Up

The minimization problem can be defined as:

In this problem is a fixed objective function that is evaluated pointwise. The first assumption is that there is no access to the gradient of . A second assumption is the evaluations of is noisy.

The problem is solved by using Bayesian Optimization, and defining the prior, a posterior to derive an acquisition function at each time step. In order to solve using BO, a sequence of points will be constructed that converge to .

Objective Function

The objective function for this example is a simple function defined below. The objective function has both a local and a global minimum as seen in Figure 1. The global minimum is at ,

Figure 1 : Objective Function :

Setting a Gaussian Process Prior

As described earlier, the Gaussian Process is most commonly used as the prior, for Bayesian Optimization.

In the example: Given inputs and the corresponding noise observations , the model takes the form:

,

The domain is the value of over which the function is evaluated. Hence a uniform distribution is utilized in this case over the space of function defined.

In the equation below, is i.i.d Gaussian noise and is a covariance matrix whose entries are given by for each pair of inputs

Define Acquistion Function

For this problem, the acquisition function used is the Lower Confidence Bound acquisition function due to its simplicity and easy interpretability.

The Lower Confidence Bound acquisition function is defined as :

Here in and are the mean and square root variance of the posterior at point

Results and Running the Optimization

Once the acquisition function and Prior is defined; the acquisition function is used to derive the next query point as follows : and then Evaluate and update the posterior, until convergence.

Results for the first 8 iterations for the Gaussian Process Regression and Acquistion functions are shown in Figure 2 and Figure 3. As one of the assumptions stated earlier was that the observation contain noise, it is impossible to converge to the ground truth solution.However, as seen in Figure 2 an 3; with a limited budget of 8 evaluations; BO has converged close to the global minimum at

Figure 2: GPR and Acquistion Function plots for first four time steps
Figure 4: GPR and Acquistion Function for next four iterations

Bayesian Optimization Packages

Here are a list of packages in Python, Java, C++that utelize Bayesian Optimization

Table - Packages Utilizing Bayesian Optimization [18]
Package License URL Language Model
SMAC Academic http://www.cs.ubc.ca/labs/beta/Projects/SMAC Java Random Forest
Hyperopt BSD https://github.com/hyperopt/hyperopt Python Tree Parzen Estimator
Spearmint Academic https://github.com/HIPS/Spearmint Python Gaussian Process
Bayesopt GPL https://github.com/rmcantin/bayesopt C++ Gaussian Process
PyBO BSD https://github.com/mwhoffman/pybo Python Gaussian Process
MOE Apache 2.0 https://github.com/Yelp/MOE Python/ C++ Gaussian Process

Conclusion

In conclusion; Bayesian Optimization primarily is utilized when Blackbox functions are expensive to evaluate and are noisy, and can be implemented easily in Python. Bayesian optimization uses a surrogate function to estimate the objective through sampling. These surrogates, Gaussian Process, are represented as probability distributions which can be updated in light of new information. The Acquisition functions, Upper Bound Confidence, Probability of Improvement, Expected Improvement, then are used to evaluate the probability of exploring a certain point in space will yield a good return. Lastly Bayesian optimization is often extended to complex problems including and not limited to hyperparameter tuning of machine learning models [19].

References

  1. R. Turner, D. Eriksson, M. Mccourt, and I. Guyon, “Bayesian Optimization is Superior to Random Search for Machine Learning Hyperparameter Tuning : arXiv : 2104 . 10201v2 [ cs . LG ] 31 Aug 2021,” 2020.
  2. J.Snoek,H.Larochelle,andR.P.Adams. PracticalBayesianoptimizationofmachinelearning algorithms. In Advances in Neural Information Processing Systems 25, pages 2960–2968, 2012.
  3. D. Eep, “F EW -S HOT B AYESIAN O PTIMIZATION K ERNEL S URROGATES,” pp. 1–15, 2021
  4. E. Brochu, V. M. Cora, and N. De Freitas, “arXiv : 1012 . 2599v1 [ cs . LG ] 12 Dec 2010 A Tutorial on Bayesian Optimization of Expensive Cost Functions , with Application to Active User Modeling and Hierarchical Reinforcement Learning,” 2010.
  5. F. Hutter, H. H. Hoos, and K. Leyton-brown, “Sequential Model-Based Optimization for General Algorithm Configuration ( extended version ).”
  6. J. Bergstra, “Algorithms for Hyper-Parameter Optimization,” pp. 1–9.
  7. E. Brochu, V. M. Cora, and N. De Freitas, “arXiv : 1012 . 2599v1 [ cs . LG ] 12 Dec 2010 A Tutorial on Bayesian Optimization of Expensive Cost Functions , with Application to Active User Modeling and Hierarchical Reinforcement Learning,” 2010.
  8. D. Jasrasaria and E. O. Pyzer-knapp, “Dynamic Control of Explore / Exploit Trade-Off In Bayesian Optimization,” no. July, 2018.
  9. B. J. Snoek, H. Larochelle, and R. P. Adams, “PRACTICAL BAYESIAN OPTIMIZATION OF MACHINE LEARNING,” pp. 1–12.
  10. E. Brochu, V. M. Cora, and N. De Freitas, “arXiv : 1012 . 2599v1 [ cs . LG ] 12 Dec 2010 A Tutorial on Bayesian Optimization of Expensive Cost Functions , with Application to Active User Modeling and Hierarchical Reinforcement Learning,” 2010.
  11. K. Swersky et al., “Taking the Human Out of the Loop : A Review of Bayesian Optimization Taking the Human Out of the Loop : A Review of Bayesian Optimization,” 2016, doi: 10.1109/JPROC.2015.2494218.
  12. Z. Wang and L. Jin, “Bayesian Multi-Scale Optimistic Optimization,” vol. 33, 2014.
  13. N. Srinivas, A. Krause, and M. Seeger, “Gaussian Process Optimization in the Bandit Setting : No Regret and Experimental Design,” 2010.
  14. A. Seko, H. Hayashi, K. Tsuda, L. Chaput, and I. Tanaka, “Prediction of Low-Thermal-Conductivity Compounds with First-Principles Anharmonic Lattice-Dynamics Calculations and Bayesian Optimization,” doi: 10.1103/PhysRevLett.115.205901.
  15.  A. Seko, T. Maekawa, K. Tsuda, and I. Tanaka, “to melting temperatures of single and binary component solids,” pp. 1–10, 2018.
  16. D. Xue, P. V Balachandran, J. Hogden, J. Theiler, D. Xue, and T. Lookman, “properties by adaptive design,” pp. 1–9, 2016, doi: 10.1038/ncomms11241.
  17. Bergstra, J., Yamins, D., Cox, D. D. (2013) Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. To appear in Proc. of the 30th International Conference on Machine Learning (ICML 2013).
  18. K. Swersky et al., “Taking the Human Out of the Loop : A Review of Bayesian Optimization Taking the Human Out of the Loop : A Review of Bayesian Optimization,” 2016, doi: 10.1109/JPROC.2015.2494218.
  19. Wu, J.; Hao, X. C.; Xiong, Z. L.; Lei, H. Hyperparameter Optimization for Machine Learning Models Based on Bayesian Optimization. J. Electron. Sci. Technol. 2019, 17 (1), 26–40. https://doi.org/10.11989/JEST.1674-862X.80904120.