Optimization

Gaussian Process (GP) Regression

Yüklə 0,51 Mb.

səhifə	4/19
tarix	12.05.2023
ölçüsü	0,51 Mb.
	#112044

1 2 3 4 5 6 7 8 9 ... 19

bayesian optimallash

Gaussian Process (GP) Regression

GP regression is a Bayesian statistical approach for modeling functions. We offer a brief introduction here. A more complete treatment may be found in Rasmussen and Williams (2006).
We first describe GP regression, focusing on f ’s values at a finite collection of points x₁, . . . , x_k ∈ R^d. It is convenient to collect the function’s values at these points together into a vector [f (x₁), . . . , f (x_k)]. Whenever we have a quantity that is unknown in Bayesian statistics, like this vector, we suppose that it was drawn at random by nature from some prior probability distribution. GP regression takes this prior distribution to be multivariate normal, with a particular mean vector and covariance matrix.

Figure 1: Illustration of BayesOpt, maximizing an objective function f with a 1-dimensional continuous input. The top panel shows: noise-free observations of the objective function f at 3 points, in blue; an estimate of f (x) (solid red line); and Bayesian credible intervals (similar to confidence intervals) for f (x) (dashed red line). These estimates and credible intervals are obtained using GP regression. The bottom panel shows the acquisition function. Bayesian optimization chooses to sample next at the point that maximizes the acquisition function, indicated here with an “x.”

^{We construct the mean vector by evaluating a}^{mean function µ}0 ^{at each}^xi^.^{We construct the}^{covariance matrix by evaluating a}^{covariance function}^or^kernel^Σ0 ^{at each pair of points}^xi^,^xj ^{. The}^{kernel is chosen so that points}^xi^{, x}j ^{that are closer in the input space have a large positive correlation,}encoding the belief that they should have more similar function values than points that are far apart. The kernel should also have the property that the resulting covariance matrix is positive semi-definite, regardless of the collection of points chosen. Example mean functions and kernels are discussed below in Section 3.1.
The resulting prior distribution on [f (x₁), . . . , f (x_k)] is,
f (x_1:_k) ∼ Normal (µ₀(x_1:_k), Σ₀(x_1:_k, x_1:_k)) , (2)
where we use compact notation for functions applied to collections of input points: x_1:_k indicates the sequence x₁, . . . , x_k, f (x_1:_k) = [f (x₁), . . . , f (x_k)], µ₀(x_1:_k) = [µ₀(x₁), . . . , µ₀(x_k)], and Σ₀(x_1:_k, x_1:_k) =
[Σ₀(x₁, x₁), . . . , Σ₀(x₁, x_k); . . . ; Σ₀(x_k, x₁), . . . , Σ₀(x_k, x_k)].
^Suppose^we^observe^f⁽^x1:n⁾^without^noise^for^someⁿ^and^we^wish^to^infer^the^value^of^f⁽^x⁾^at^somenew point x. To do so, we let k = n + 1 and x_k = x, so that the prior over [f (x_1:_n), f (x)] is given by (2). We may then compute the conditional distribution of f (x) given these observations using Bayes’ rule (see details in Chapter 2.1 of Rasmussen and Williams (2006)),

Yüklə 0,51 Mb.

Dostları ilə paylaş:

1 2 3 4 5 6 7 8 9 ... 19