Optimization


Choose x(r) uniformly at random from A



Yüklə 0,51 Mb.
səhifə10/19
tarix12.05.2023
ölçüsü0,51 Mb.
#112044
1   ...   6   7   8   9   10   11   12   13   ...   19
bayesian optimallash

for r = 1 to R do


0
Choose x(r) uniformly at random from A.


for t = 1 to T do


t−1
Let G be the stochastic gradient estimate of ∇KGn(x(r) ) from Algorithm 4.


Let αt = a/(a + t).

x(r) = x(r)
+ αtG.

t

end for


t−1


T
Estimate KGn(x(r)) using Algorithm 2 and J replications.

end for

Return the x(r) with the largest estimated value of KGn(x(r)).


T T

This stochastic gradient G used by the inner loop of Algorithm 3 is calculated via Algorithm 4. This algorithm is based on the idea that we can exchange gradient and expectation (under sufficient regularity conditions) to write,


∇KGn(x) = ∇En [µn+1µn |xn+1 = x] . = En [∇µn+1|xn+1 = x] ,
where we have noted that µn does not depend on x. This approach is called infinitesimal perturbation analysis (Ho et al., 1983). Thus, to construct a stochastic gradient it is sufficient to sample ∇µn+1. In other words, imagine first sampling Z in the inner loop in Algorithm 2, and then holding Z fixed while calculating the gradient of µn+1 with respect to x. To calculate this gradient, see that µn+1 is a maximum over xof µn+1(x; x, yn+1) = µn+1(x; x, µn(x) + σn(x)Z). This is a maximum over collection of functions of x. The envelope theorem (Milgrom and Segal, 2002) tells us (under sufficient regularity

^

^ ^
conditions) that the gradient with respect to x of a maximum of a collection of functions of x is given simply by first finding the maximum in this collection, and then differentiating this single function with respect to x. In our setting, we apply this by letting x be the x maximizing µn+1(x; x, µn(x)+σn(x)Z),
and then calculating the gradient of µn+1(x; x, µn(x)+σn(x)Z) with respect to x while holding x fixed.
In other words,

^∇ ∇
max µn+1(x; x, µn(x) + σn(x)Z) = µn+1(x; x, µn(x) + σn(x)Z),
x'




where we remind the reader that ∇ refers to taking the gradient with respect to x, here and throughout. This is summarized in Algorithm 4.



Algorithm 4 Simulation of unbiased stochastic gradients G with E[G] = KGn(x). This stochastic gradient can then be used within stochastic gradient ascent to optimize the KG acquisition function.


for j = 1 to J do

Generate Z ∼ Normal(0, 1)


yn+1 = µn(x) + σn(x)Z.

Let µn+1(x; x, yn+1) = µn+1(x; x, µn(x) + σn(x)Z) be the posterior mean at x computed via (3) with (x, yn+1) as the last observation.



^

^ ^
Solve maxx' µn+1(x; x, yn+1), e.g., using L-BFGS. Let x be the maximizing x.
Let G(j) be the gradient of µn+1(x; x, µn(x) + σn(x)Z) with respect to x, holding x fixed.

end for


J

j=1
Estimate ∇KGn(x) by G = 1 ΣJ


G(j).

Unlike expected improvement, which only considers the posterior at the point sampled, the KG acquisition considers the posterior over f ’s full domain, and how the sample will change that posterior. KG places a positive value on measurements that cause the maximum of the posterior mean to improve, even if the value at the sampled point is not better than the previous best point. This provides a small performance benefit in the standard BayesOpt problem with noise-free evaluations (Frazier et al., 2009),



and provides substantial performance improvements in problems with noise, multi-fidelity observations, derivative observations, the need to integrate over environmental conditions, and other more exotic problem features (Section 5). In these alternate problems, the value of sampling comes not through an improvement in the best solution at the sampled point, but through an improvement in the maximum of the posterior mean across feasible solutions. For example, a derivative observation may show that the function is increasing in a particular direction in the vicinity of the sampled point. This may cause the maximum of the posterior mean to be substantially larger than the previous maximum, even if the function value at the sampled point is worse than the best previously sampled point. When such phenomenon are first-order, KG tends to significantly outperform EI (Wu et al., 2017; Poloczek et al., 2017; Wu and Frazier, 2016; Toscano-Palmerin and Frazier, 2018).


    1. Entropy Search and Predictive Entropy Search



¸
The entropy search (ES) (Hennig and Schuler, 2012) acquisition function values the information we have about the location of the global maximum according to its differential entropy. ES seeks the point to eval- uate that causes the largest decrease in differential entropy. (Recall from, e.g., Cover and Thomas (2012), that the differential entropy of a continuous probability distribution p(x) is p(x) log(p(x)) dx, and that smaller differential entropy indicates less uncertainty.) Predictive entropy search (PES) (Hern´andez- Lobato et al., 2014) seeks the same point, but uses a reformulation of the entropy reduction objective based on mutual information. Exact calculations of PES and ES would give equivalent acquisition func- tions, but exact calculation is not typically possible, and so the difference in computational techniques used to approximate the PES and ES acquisition functions creates practical differences in the sampling decisions that result from the two approaches. We first discuss ES and then PES.
Let x be the global optimum of f . The posterior distribution on f at time n induces a probability distribution for x. Indeed, if the domain A were finite, then we could represent f over its domain by a vector (f (x) : x A), and x would correspond to the largest element in this vector. The distribution of this vector under the time-n posterior distribution would be multivariate normal, and this multivariate normal distribution would imply the distribution of x. When A is continuous, the same ideas apply, where x is a random variable whose distribution is implied by the Gaussian process posterior on f .
With this understanding, we represent the entropy of the time-n posterior distribution on x with

|
the notation H(Pn(x)). Similarly, H(Pn(xx, f (x))) represents the entropy of what the time-n + 1 posterior distribution on x will be if we observe at x and see f (x). This quantity depends on the value of f (x) observed. Then, the entropy reduction due to sampling x can be written,
ESn(x) = H(Pn(x)) − Ef(x) [H(Pn(x|f (x)))] . (11)

¸

n
the normal density with mean µn(x) and variance σ2 (x).

n

n

In the second term, the subscript in the outer expectation indicates that we take the expectation over f (x). Equivalently, this can be written ϕ(y; µn(x), σ2 (x))H(Pn(x|f (x) = y)) dy where ϕ(y; µn(x), σ2 (x)) is
Like KG, ES and PES below are influenced by how the measurement changes the posterior over the
whole domain, and not just on whether it improves over an incumbent solution at the point sampled. This is useful when deciding where to sample in exotic problems, and it is here that ES and PES can provide substantial value relative to EI.
While ES can be computed and optimized approximately (Hennig and Schuler, 2012), doing so is challenging because (a) the entropy of the maximizer of a Gaussian process is not available in closed form; (b) we must calculate this entropy for a large number of y to approximate the expectation in (11); and (c) we must then optimize this hard-to-evaluate function. Unlike KG, there is no known method for computing stochastic gradients that would simplify this optimization.
PES offers an alternate approach for computing (11). This approach notes that the reduction in the entropy of x due to measuring f (x) is equal to the mutual information between f (x) and x, which is in turn equal to the reduction in the entropy of f (x) due to measuring x. This equivalence gives the expression
PESn(x) = ESn(x) = H(Pn(f (x))) Ex [H(Pn(f (x)|x))] (12)
Here, the subscript in the expectation in the second term indicates that the expectation is taken over x. Unlike ES, the first term in the PES acquisition function, H(Pn(f (x))), can be computed in closed form. The second term must still be approximated: Hern´andez-Lobato et al. (2014) provides a method
for sampling xfrom the posterior distribution, and a method for approximating H(Pn(f (x)|x)) using expectation propagation (Minka, 2001). This evaluation method may then be optimized by a method for derivative-free optimization via simulation.

    1. Yüklə 0,51 Mb.

      Dostları ilə paylaş:
1   ...   6   7   8   9   10   11   12   13   ...   19




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin