Optimization

Yüklə 0,51 Mb.

səhifə	8/19
tarix	12.05.2023
ölçüsü	0,51 Mb.
	#112044

1 ... 4 5 6 7 8 9 10 11 ... 19

bayesian optimallash

Knowledge Gradient

The knowledge-gradient acquisition function is derived by revisiting the assumption made in EI’s deriva- tion that we are only willing to return a previously evaluated point as our final solution. That assumption is reasonable when evaluations are noise-free and we are highly risk averse, but if the decision-maker is willing to tolerate some risk then she might be willing to report a final solution that has some uncertainty attached to it. Moreover, if evaluations have noise (discussed below in Section 5) then the final solution reported will necessarily have uncertain value because we can hardly evaluate it an infinite number of times.

^
We replace this assumption by allowing the decision-maker to return any solution she likes, even if it has not been previously evaluated. We also assume risk-neutrality (Berger, 2013), i.e., we value a random outcome X according to its expected value. The solution that we would choose if we were to stop

^ ^

^
^sampling^afterⁿ^samples^would^be^the^one^with^the^largest^µn⁽^x⁾^value.^This^solution^(call^it^x^∗^,^sinceit approximates the global optimum x^∗) would have value f (x^∗). f (x^∗) is random under the posterior, and has conditional expected value µ_n(x^∗) = max_x^'µ_n(x^′) =: µ^∗_n.
On the other hand, if we were to take one more sample at x, we would obtain a new posterior
^{distribution with posterior mean}^µn+1⁽^·^{). This posterior mean would be computed via (3), but including}^the^additional^observation^xn+1^,^yn+1^.^If^we^were^to^report^a^final^solution^after^this^sample,^its^expected
value under the new posterior distribution would be µ^∗_n₊₁ := max_x^'µ_n₊₁(x^′). Thus, the increase in conditional expected solution value due to sampling is µ^∗_n₊₁ − µ^∗_n.
^{While this quantity is unknown before we sample at}^xn+1^,^{we can compute its expected value given}^the^observations^at^x1^{, . . . , x}n ^{that we have.}^{We call this quantity the}^knowledge^gradient^(KG)^formeasuring at x,
KG_n(x) := E_n [µ^∗_n₊₁ − µ_n^∗ |x_n₊₁ = x] . (10)
Using the knowledge gradient as our acquisition function then leads us to sample at the point with largest ^KGn⁽^x^),^argmax^x^KGn⁽^x^).
This algorithm was first proposed in Frazier et al. (2009) for GP regression over discrete A, building
on earlier work (Frazier et al., 2008) that proposed the same algorithm for Bayesian ranking and selection (Chick and Inoue, 2001) with an independent prior. (Bayesian ranking and selection is similar to Bayesian optimization, except that A is discrete and finite, observations are necessarily noisy, and the prior is typically independent across x.)
The conceptually simplest way to compute the KG acquisition function is via simulation, as shown ^{in Algorithm 2. Within a loop, this algorithm simulates one possible value for the observation}^yn+1 ^thatmay result from taking evaluation n + 1 at a designated x. Then it computes what the maximum of
the new posterior mean µ^∗_n₊₁ would be if that value for y_n₊₁ were the one that actually resulted from the measurement. It then subtracts µ_n^∗ to obtain the corresponding increase in solution quality. This comprises one loop of the algorithm. It iterates this loop many (J) times and averages the differences
µ^∗_n₊₁ − µ^∗_n obtained from different simulated values for y_n₊₁ to estimate the KG acquisition function
^KGn⁽^x^).^As^J^grows^large,^this^estimate^converges^to^KGn⁽^x^).
^{In principle, this algorithm can be used to evaluate KG}n⁽^x^{) within a derivative-free simulation-based}optimization method to optimize the KG acquisition function. However, optimizing noisy simulation- based functions without access to derivatives is challenging. Frazier et al. (2009) proposed discretizing A and calculating (10) exactly using properties of the normal distribution. This works well for low- dimensional problems but becomes computationally burdensome in higher dimensions.

Algorithm 2 Simulation-based computation of the knowledge-gradient factor KG_n(x).

Let µ^∗_n= max_x^'µ_n(x^′).

(When calculating µ^∗_nand µ^∗_n₊₁below, use a nonlinear optimization method like L-BFGS.)

Yüklə 0,51 Mb.

Dostları ilə paylaş:

1 ... 4 5 6 7 8 9 10 11 ... 19

Optimization

Knowledge Gradient

Algorithm 2 Simulation-based computation of the knowledge-gradient factor KGn(x).

(When calculating µ∗n and µ∗n+1 below, use a nonlinear optimization method like L-BFGS.)

Algorithm 2 Simulation-based computation of the knowledge-gradient factor KG_n(x).

(When calculating µ^∗_nand µ^∗_n₊₁below, use a nonlinear optimization method like L-BFGS.)