The knowledge-gradient acquisition function is derived by revisiting the assumption made in EI’s deriva- tion that we are only willing to return a previously evaluated point as our final solution. That assumption is reasonable when evaluations are noise-free and we are highly risk averse, but if the decision-maker is willing to tolerate some risk then she might be willing to report a final solution that has some uncertainty attached to it. Moreover, if evaluations have noise (discussed below in Section 5) then the final solution reported will necessarily have uncertain value because we can hardly evaluate it an infinite number of times.
^
We replace this assumption by allowing the decision-maker to return any solution she likes, even if it has not been previously evaluated. We also assume risk-neutrality (Berger, 2013), i.e., we value a random outcome X according to its expected value. The solution that we would choose if we were to stop
^ ^
^
samplingafternsampleswouldbetheonewiththelargestµn(x)value.Thissolution(callitx∗,sinceit approximates the global optimum x∗) would have value f (x∗). f (x∗) is random under the posterior, and has conditional expected value µn(x∗) = maxx'µn(x′) =: µ∗n.
On the other hand, if we were to take one more sample at x, we would obtain a new posterior
distribution with posterior mean µn+1(·). This posterior mean would be computed via (3), but includingtheadditionalobservationxn+1,yn+1. Ifweweretoreportafinalsolutionafterthissample,itsexpected value under the new posterior distribution would be µ∗n+1 := maxx'µn+1(x′). Thus, the increase in conditional expected solution value due to sampling is µ∗n+1 − µ∗n.
While this quantity is unknown before we sample at xn+1, we can compute its expected value giventheobservations at x1, . . . , xnthat we have.We call this quantity the knowledgegradient(KG) formeasuring at x,
KGn(x) := En[µ∗n+1 − µn∗ |xn+1 = x] . (10)
Using the knowledge gradient as our acquisition function then leads us to sample at the point with largest KGn(x),argmaxxKGn(x). This algorithm was first proposed in Frazier et al. (2009) for GP regression over discrete A, building
on earlier work (Frazier et al., 2008) that proposed the same algorithm for Bayesian ranking and selection (Chick and Inoue, 2001) with an independent prior. (Bayesian ranking and selection is similar to Bayesian optimization, except that A is discrete and finite, observations are necessarily noisy, and the prior is typically independent across x.)
The conceptually simplest way to compute the KG acquisition function is via simulation, as shown in Algorithm 2. Within a loop, this algorithm simulates one possible value for the observation yn+1 thatmay result from taking evaluation n+ 1 at a designated x. Then it computes what the maximum of
the new posterior mean µ∗n+1 would be if that value for yn+1 were the one that actually resulted from the measurement. It then subtracts µn∗ to obtain the corresponding increase in solution quality. This comprises one loop of the algorithm. It iterates this loop many (J) times and averages the differences
µ∗n+1 − µ∗nobtained from different simulated values for yn+1 to estimate the KG acquisition function
KGn(x).AsJgrowslarge,thisestimateconvergestoKGn(x). In principle, this algorithm can be used to evaluate KGn(x) within a derivative-free simulation-basedoptimization method to optimize the KG acquisition function. However, optimizing noisy simulation- based functions without access to derivatives is challenging. Frazier et al. (2009) proposed discretizing A and calculating (10) exactly using properties of the normal distribution. This works well for low- dimensional problems but becomes computationally burdensome in higher dimensions.