Having surveyed Gaussian processes, we return to Algorithm 1 and discuss the acquisition function used in that loop. We focus on the setting described in Section 1 with noise-free evaluations, which we call the “standard” problem, and then discuss noisy evaluations, parallel evaluations, derivative observations, and other “exotic” extensions in Section 5.
The most commonly used acquisition function is expected improvement, and we discuss it first, in Section 4.1. Expected improvement performs well and is easy to use. We then discuss the knowledge gradient (Section 4.2), entropy search and predictive entropy search (Section 4.3) acquisition functions. These alternate acquisition functions are most useful in exotic problems where an assumption made by expected improvement, that the primary benefit of sampling occurs through an improvement atthepointsampled, is no longer true.
Expected Improvement
The expected improvement acquisition function is derived by a thought experiment. Suppose we are using Algorithm 1 to solve (1), in which xn indicates the point sampled at iteration n and yn indicatesthe observed value. Assume that we may only return a solution that we have evaluated as our final solution to (1). Also suppose for the moment that we have no evaluations left to make, and must return a solution based on those we have already performed. Since we observe f without noise, the optimal choice is the previously evaluated point with the largest observed value. Let fn∗ = maxm≤n f(xm) be the value of this point, where nis the number of times we have evaluated fthus far.
Now suppose in fact we have one additional evaluation to perform, and we can perform it anywhere. If we evaluate at x, we will observe f (x). After this new evaluation, the value of the best point we have observed will either be f(x) (if f(x) ≥ fn∗) or fn∗ (if f(x) ≤ fn∗). The improvement in the value of the
best observed point is then f(x) − fn∗ if this quantity is positive, and 0 otherwise. We can write this improvement more compactly as [f(x) − fn∗]+, where a+ = max(a,0) indicates the positive part.
While we would like to choose xso that this improvement is large, f(x) is unknown until after the
evaluation. What we can do, however, is to take the expected value of this improvement and choose x to maximize it. We define the expectedimprovementas,
EIn(x) := En[f(x) − fn∗]+(7)
n Here, En[·] = E[·|x1:n, y1:n] indicates the expectation taken under the posterior distribution given eval-uations of fat x1, . . . xn.This posterior distribution is given by(3):f (x) given x1:n, y1:n is normallydistributedwithmeanµn(x)andvarianceσ2(x). The expected improvement can be evaluated in closed form using integration by parts, as described
in Jones et al. (1998) or Clark (1961). The resulting expression is