Above we described methodology for solving the “standard” Bayesian optimization problem described in Section 1. This problem assumed a feasible set in which membership is easy to evaluate, such as a hyperrectangle or simplex; a lack of derivative information; and noise-free evaluations.
While there are quite a few applied problems that meet all of the assumptions of the standard problem, there are even more where one or more of these assumptions are broken. We call these “exotic” problems. Here, we describe some prominent examples and give references for more detailed reading. (Although we discuss noisy evaluations in this section on exotic problems, they are substantially less exotic than the others considered, and are often considered to be part of the standard problem.)
NoisyEvaluations GP regression can be extended naturally to observations with independent nor- mally distributed noise of known variance (Rasmussen and Williams, 2006). This adds a diagonal term with entries equal to the variance of the noise to the covariance matrices in (3). In practice, this variance is not known, and so the most common approach is to assume that the noise is of common variance and to include this variance as a hyperparameter. It is also possible to perform inference assuming that the variance changes with the domain, by modeling the log of the variance with a second Gaussian process (Kersting et al., 2007).
The KG, ES, and PES acquisition functions apply directly in the setting with noise and they retain
their one-step optimality properties. One simply uses the posterior mean of the Gaussian process that includes noise.
Direct use of the EI acquisition function presents conceptual challenges, however, since the “improve- ment” that results from a function value is no longer easily defined, and f (x) in (7) is no longer observed. Authors have employed a variety of heuristic approaches, substituting different normal distributions for the distribution of f (x) in (7), and typically using the maximum of the posterior mean at the previously evaluated points in place of fn∗. Popular substitutes for the distribution of f(x) include the distribution ofµn+1(x),thedistributionof yn+1,and continuing to use the distribution of f (x) even though it isnot observed. Because of these approximations, KG can outperform EI substantially in problems with substantial noise (Wu and Frazier, 2016; Frazier et al., 2009).
^
As an alternative approach to applying EI when measurements are noisy, Scott et al. (2011) considers noisy evaluations under the restriction made in the derivation of EI: that the reported solution needs to be a previously reported point. It then finds the one-step optimal place to sample under this assumption. ItsanalysisissimilartothatusedtoderivetheKGpolicy,exceptthatwerestrictx∗ tothosepointsthat have been evaluated.
Indeed, if we were to report a final solution after n measurements, it would be the point among x1:nwith the largest value of µn(x), and it would have conditional expected value µn∗∗ = maxi=1,...,n µn(xi). If we were to take one more sample at xn+1 = x, it would have conditional expected value under thenew posterior of µ∗n∗+1 = maxi=1,...,n+1µn+1(xi). Taking the expected value of the difference, the value of sampling at xis
En[µ∗n∗+1 − µn∗∗|xn+1 = x] . (13)
Unlikethecasewithnoise-freeevaluations,thissamplemaycauseµn+1(xi)todifferfromµn(xi)fori ≤ n, necessitating a more complex calculation than in the noise-free setting (but a simpler calculation than for the KG policy). A procedure for calculating this quantity and its derivative is given in Scott et al. (2011). While we can view this acquisition function as an approximation to the KG acquisition function as Scott et al. (2011) does (they call it the KGCP acquisition function), we argue here that it is the most natural generalization of EI’s assumptions to the case with noisy measurements.