Optimization


Gaussian Process (GP) Regression



Yüklə 0,51 Mb.
səhifə4/19
tarix12.05.2023
ölçüsü0,51 Mb.
#112044
1   2   3   4   5   6   7   8   9   ...   19
bayesian optimallash

Gaussian Process (GP) Regression


GP regression is a Bayesian statistical approach for modeling functions. We offer a brief introduction here. A more complete treatment may be found in Rasmussen and Williams (2006).
We first describe GP regression, focusing on f ’s values at a finite collection of points x1, . . . , xk ∈ Rd. It is convenient to collect the function’s values at these points together into a vector [f (x1), . . . , f (xk)]. Whenever we have a quantity that is unknown in Bayesian statistics, like this vector, we suppose that it was drawn at random by nature from some prior probability distribution. GP regression takes this prior distribution to be multivariate normal, with a particular mean vector and covariance matrix.



Figure 1: Illustration of BayesOpt, maximizing an objective function f with a 1-dimensional continuous input. The top panel shows: noise-free observations of the objective function f at 3 points, in blue; an estimate of f (x) (solid red line); and Bayesian credible intervals (similar to confidence intervals) for f (x) (dashed red line). These estimates and credible intervals are obtained using GP regression. The bottom panel shows the acquisition function. Bayesian optimization chooses to sample next at the point that maximizes the acquisition function, indicated here with an “x.”




We construct the mean vector by evaluating a mean function µ0 at each xi. We construct the covariance matrix by evaluating a covariance function or kernel Σ0 at each pair of points xi, xj . The kernel is chosen so that points xi, xj that are closer in the input space have a large positive correlation, encoding the belief that they should have more similar function values than points that are far apart. The kernel should also have the property that the resulting covariance matrix is positive semi-definite, regardless of the collection of points chosen. Example mean functions and kernels are discussed below in Section 3.1.
The resulting prior distribution on [f (x1), . . . , f (xk)] is,
f (x1:k) ∼ Normal (µ0(x1:k), Σ0(x1:k, x1:k)) , (2)
where we use compact notation for functions applied to collections of input points: x1:k indicates the sequence x1, . . . , xk, f (x1:k) = [f (x1), . . . , f (xk)], µ0(x1:k) = [µ0(x1), . . . , µ0(xk)], and Σ0(x1:k, x1:k) =
0(x1, x1), . . . , Σ0(x1, xk); . . . ; Σ0(xk, x1), . . . , Σ0(xk, xk)].
Suppose we observe f (x1:n) without noise for some n and we wish to infer the value of f (x) at some new point x. To do so, we let k = n + 1 and xk = x, so that the prior over [f (x1:n), f (x)] is given by (2). We may then compute the conditional distribution of f (x) given these observations using Bayes’ rule (see details in Chapter 2.1 of Rasmussen and Williams (2006)),


Yüklə 0,51 Mb.

Dostları ilə paylaş:
1   2   3   4   5   6   7   8   9   ...   19




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin