Optimization



Yüklə 0,51 Mb.
səhifə8/19
tarix12.05.2023
ölçüsü0,51 Mb.
#112044
1   ...   4   5   6   7   8   9   10   11   ...   19
bayesian optimallash

Knowledge Gradient


The knowledge-gradient acquisition function is derived by revisiting the assumption made in EI’s deriva- tion that we are only willing to return a previously evaluated point as our final solution. That assumption is reasonable when evaluations are noise-free and we are highly risk averse, but if the decision-maker is willing to tolerate some risk then she might be willing to report a final solution that has some uncertainty attached to it. Moreover, if evaluations have noise (discussed below in Section 5) then the final solution reported will necessarily have uncertain value because we can hardly evaluate it an infinite number of times.

^
We replace this assumption by allowing the decision-maker to return any solution she likes, even if it has not been previously evaluated. We also assume risk-neutrality (Berger, 2013), i.e., we value a random outcome X according to its expected value. The solution that we would choose if we were to stop

^ ^

^
sampling after n samples would be the one with the largest µn(x) value. This solution (call it x, since it approximates the global optimum x) would have value f (x). f (x) is random under the posterior, and has conditional expected value µn(x) = maxx' µn(x) =: µn.
On the other hand, if we were to take one more sample at x, we would obtain a new posterior
distribution with posterior mean µn+1(·). This posterior mean would be computed via (3), but including the additional observation xn+1, yn+1. If we were to report a final solution after this sample, its expected
value under the new posterior distribution would be µn+1 := maxx' µn+1(x). Thus, the increase in conditional expected solution value due to sampling is µn+1µn.
While this quantity is unknown before we sample at xn+1, we can compute its expected value given the observations at x1, . . . , xn that we have. We call this quantity the knowledge gradient (KG) for measuring at x,
KGn(x) := En [µn+1µn |xn+1 = x] . (10)
Using the knowledge gradient as our acquisition function then leads us to sample at the point with largest KGn(x), argmaxx KGn(x).
This algorithm was first proposed in Frazier et al. (2009) for GP regression over discrete A, building
on earlier work (Frazier et al., 2008) that proposed the same algorithm for Bayesian ranking and selection (Chick and Inoue, 2001) with an independent prior. (Bayesian ranking and selection is similar to Bayesian optimization, except that A is discrete and finite, observations are necessarily noisy, and the prior is typically independent across x.)
The conceptually simplest way to compute the KG acquisition function is via simulation, as shown in Algorithm 2. Within a loop, this algorithm simulates one possible value for the observation yn+1 that may result from taking evaluation n + 1 at a designated x. Then it computes what the maximum of
the new posterior mean µn+1 would be if that value for yn+1 were the one that actually resulted from the measurement. It then subtracts µn to obtain the corresponding increase in solution quality. This comprises one loop of the algorithm. It iterates this loop many (J) times and averages the differences
µn+1µn obtained from different simulated values for yn+1 to estimate the KG acquisition function
KGn(x). As J grows large, this estimate converges to KGn(x).
In principle, this algorithm can be used to evaluate KGn(x) within a derivative-free simulation-based optimization method to optimize the KG acquisition function. However, optimizing noisy simulation- based functions without access to derivatives is challenging. Frazier et al. (2009) proposed discretizing A and calculating (10) exactly using properties of the normal distribution. This works well for low- dimensional problems but becomes computationally burdensome in higher dimensions.



Algorithm 2 Simulation-based computation of the knowledge-gradient factor KGn(x).


Let µn = maxx' µn(x).

(When calculating µn and µn+1 below, use a nonlinear optimization method like L-BFGS.)



Yüklə 0,51 Mb.

Dostları ilə paylaş:
1   ...   4   5   6   7   8   9   10   11   ...   19




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin