GP regression is a Bayesian statistical approach for modeling functions. We offer a brief introduction here. A more complete treatment may be found in Rasmussen and Williams (2006).
We first describe GP regression, focusing on f’s values at a finite collection of points x1,...,xk∈ Rd. It is convenient to collect the function’s values at these points together into a vector [f (x1), . . . , f (xk)]. Whenever we have a quantity that is unknown in Bayesian statistics, like this vector, we suppose that it was drawn at random by nature from some prior probability distribution. GP regression takes this prior distribution to be multivariate normal, with a particular mean vector and covariance matrix.
Figure 1: Illustration of BayesOpt, maximizing an objective function f with a 1-dimensional continuous input. The top panel shows: noise-free observations of the objective function f at 3 points, in blue; an estimate of f (x) (solid red line); and Bayesian credible intervals (similar to confidence intervals) for f (x) (dashed red line). These estimates and credible intervals are obtained using GP regression. The bottom panel shows the acquisition function. Bayesian optimization chooses to sample next at the point that maximizes the acquisition function, indicated here with an “x.”
We construct the mean vector by evaluating a mean function µ0 at each xi.We construct thecovariance matrix by evaluating a covariance function or kernel Σ0 at each pair of points xi, xj . Thekernel is chosen so that points xi, xjthat are closer in the input space have a large positive correlation,encoding the belief that they should have more similar function values than points that are far apart. The kernel should also have the property that the resulting covariance matrix is positive semi-definite, regardless of the collection of points chosen. Example mean functions and kernels are discussed below in Section 3.1.
The resulting prior distribution on [f(x1),...,f(xk)] is,
f(x1:k) ∼ Normal (µ0(x1:k),Σ0(x1:k,x1:k)) , (2)
where we use compact notation for functions applied to collections of input points: x1:kindicates the sequence x1,...,xk, f(x1:k) = [f(x1),...,f(xk)], µ0(x1:k) = [µ0(x1),...,µ0(xk)], and Σ0(x1:k,x1:k) =
[Σ0(x1,x1),...,Σ0(x1,xk); ...; Σ0(xk,x1),...,Σ0(xk,xk)].
Supposeweobservef(x1:n)withoutnoiseforsomenandwewishtoinferthevalueoff(x)atsomenew point x. To do so, we let k= n+ 1 and xk= x, so that the prior over [f(x1:n),f(x)] is given by (2). We may then compute the conditional distribution of f (x) given these observations using Bayes’ rule (see details in Chapter 2.1 of Rasmussen and Williams (2006)),