Optimization

Yüklə 0,51 Mb.

səhifə	5/19
tarix	12.05.2023
ölçüsü	0,51 Mb.
	#112044

1 2 3 4 5 6 7 8 9 ... 19

bayesian optimallash

Choosing Hyperparameters

n
^f⁽^x⁾^|^f⁽^x1:n⁾^∼^Normal(^µn⁽^x⁾^,^σ²⁽^x⁾⁾

^µn⁽^x⁾⁼^Σ0⁽^x,^x1:n^)Σ0⁽^x1:n^,^x1:n⁾⁻¹⁽^f⁽^x1:n⁾⁻^µ0⁽^x1:n⁾⁾⁺^µ0⁽^x⁾

n
^σ²⁽^x⁾⁼^Σ0⁽^x,^x⁾⁻^Σ0⁽^x,^x1:n^)Σ0⁽^x1:n^,^x1:n⁾⁻¹^Σ0⁽^x1:n^,^x⁾^.
(3)

This conditional distribution is called the posterior probability distribution in the nomenclature of

Figure 2: Random functions f drawn from a Gaussian process prior with a power exponential kernel. Each plot corresponds to a different value for the parameter α₁, with α₁ decreasing from left to right. Varying this parameter creates different beliefs about how quickly f (x) changes with x.

n
^{Bayesian statistics. The posterior mean}^µn⁽^x^{) is a weighted average between the prior}^µ0⁽^x^{) and an}^estimate^based^on^the^data^f⁽^x1:n^),^with^a^weight^that^depends^on^the^kernel.^The^posterior^variance^σ^{2 (}^x^{) is equal to the prior covariance Σ}0⁽^{x, x}^{) less a term that corresponds to the variance removed by}^observing^f⁽^x1:n^).
Rather than computing posterior means and variances directly using (3) and matrix inversion, it is typically faster and more numerically stable to use a Cholesky decomposition and then solve a linear system of equations. This more sophisticated technique is discussed as Algorithm 2.1 in Section 2.2 of Rasmussen and Williams (2006). Additionally, to improve the numerical stability of this approach or direct computation using (3), it is often useful to add a small positive number like 10⁻⁶ to each element ^of^the^diagonal^of^Σ0⁽^x1:n^{, x}1:n^),^especially^when^x1:n ^contains^two^or^more^points^that^are^close^together.^This^prevents^eigenvalues^of^Σ0⁽^x1:n^{, x}1:n⁾^from^{being too close to 0, and only changes the predictions}that would be made by an infinite-precision computation by a small amount.
Although we have modeled f at only a finite number of points, the same approach can be used when ^modeling^f^{over a continuous domain}^A^{. Formally a}^{Gaussian process}^{with mean function}^µ0 ^{and kernel}^Σ0 ^{is a probability distribution over the function}^f^{with the property that, for any given collection of}points x_1:_k, the marginal probability distribution on f (x_1:_k) is given by (2). Moreover, the arguments that justified (3) still hold when our prior probability distribution on f is a Gaussian process.
^In^addition^to^calculating^the^conditional^distribution^of^f⁽^x⁾^given^f⁽^x1:n^),^it^is^also^possible^tocalculate the conditional distribution of f at more than one unevaluated point. The resulting distribution is multivariate normal, with a mean vector and covariance kernel that depend on the location of the ^unevaluated^points,^the^locations^of^the^measured^points^x1:n^,^and^their^measured^values^f⁽^x1:n^).^Thefunctions that give entries in this mean vector and covariance matrix have the form required for a mean ^{function and kernel}^{described above, and the conditional distribution of}^f^given^f⁽^x1:n^{) is a Gaussian}process with this mean function and covariance kernel.

Choosing a Mean Function and Kernel

′ ′′
We now discuss the choice of kernel. Kernels typically have the property that points closer in the input space are more strongly correlated, i.e., that if ||x − x^′|| < ||x − x^′′|| for some norm || · ||, then ^Σ0⁽^{x, x}⁾^>^Σ0⁽^{x, x}^).^{Additionally, kernels are required to be positive semi-definite functions.}^{Here we}describe two example kernels and how they are used.
One commonly used and simple kernel is the power exponential or Gaussian kernel,
^Σ0⁽^x,^x^′⁾⁼^α0 ^exp^−||^x⁻^x^′^||²^,

i=1
where ||x − x^′||²= ^Σ^dα_i(x_i − x_i^′ )², and α_0:_d are parameters of the kernel. Figure 2 shows random
functions with a 1-dimensional input drawn from a Gaussian process prior with a power exponential
^kernel^with^{different values of}^α1^.^Varying^{this parameter creates different}^beliefs^about^how^quickly
f (x) changes with x.

^Kν ⁽

2ν||x − x ||)
Another commonly used kernel is the M`atern kernel,

^Σ0⁽^x,^x⁾⁼^α0 _Γ(_ν₎

2ν||x − x ||
_′ ₂1−ν _√
_′ ν √ _′

where K_ν is the modified Bessel function, and we have a parameter ν in addition to the parameters α_0:_d. We discuss choosing these parameters below in Section 3.2.

Σ
^{Perhaps the most common choice for the mean function is a constant value,}^µ0⁽^x⁾⁼^µ^.^When^f^isbelieved to have a trend or some application-specific parametric structure, we may also take the mean function to be
p
^µ0⁽^x⁾⁼^µ⁺^βi^Ψi⁽^x⁾^,⁽⁴⁾
i=1
^where^each^Ψi ^is^a^parametric^function,^and^often^a^low-order^polynomialⁱⁿ^x^.

Choosing Hyperparameters

The mean function and kernel contain parameters. We typically call these parameters of the prior hyperparameters. We indicate them via a vector η. For example, if we use a M`atern kernel and a constant mean function, η = (α_0:_d, ν, µ).
To choose the hyperparameters, three approaches are typically considered. The first is to find the ^{maximum likelihood estimate}^{(MLE). In this approach, when given observations}^f⁽^x1:n^{), we calculate the}^likelihood^of^{these observations}^{under the prior,}^P⁽^f⁽^x1:n⁾^|^η^{), where we modify our notation to indicate}its dependence on η. This likelihood is a multivariate normal density. Then, in maximum likelihood estimation, we set η to the value that maximizes this likelihood,

|
^η^ˆ⁼^argmax^P⁽^f⁽^x1:n⁾^η⁾
η
The second approach amends this first approach by imagining that the hyperparameters η were themselves chosen from a prior, P (η). We then estimate η by the maximum a posteriori (MAP) estimate (Gelman et al., 2014), which is the value of η that maximizes the posterior,

^η^ˆ⁼^argmax^P⁽^η^|^f⁽^x1:n⁾⁾⁼^argmax^P⁽^f⁽^x1:n⁾^|^η⁾^P⁽^η⁾
η η

¸
In moving from the first expression to the second we have used Bayes’ rule and then dropped a normal- ^ization^constant^P⁽^f⁽^x1:n⁾^|^η^′⁾^P⁽^η^′⁾^dη^′^that^does^not^depend^on^the^quantity^η^being^optimized.
The MLE is a special case of the MAP if we take the prior on the hyperparameters P (η) to be the
(possibly degenerate) probability distribution that has constant density over the domain of η. The MAP is useful if the MLE sometimes estimates unreasonable hyperparameter values, for example, corresponding to functions that vary too quickly or too slowly (see Figure 2). By choosing a prior that puts more weight on hyperparameter values that are reasonable for a particular problem, MAP estimates can better correspond to the application. Common choices for the prior include the uniform distribution (for preventing estimates from falling outside of some pre-specified range), the normal distribution (for suggesting that the estimates fall near some nominal value without setting a hard cutoff), and the log- normal and truncated normal distributions (for providing a similar suggestion for positive parameters).
The third approach is called the fully Bayesian approach. In this approach, we wish to compute the posterior distribution on f (x) marginalizing over all possible values of the hyperparameters,

^Σ≈ |
^P⁽^f⁽^x⁾⁼^y^|^f⁽^x1:n⁾⁾⁼∫ ^P⁽^f⁽^x⁾⁼^y^|^f⁽^x1:n⁾^,^η⁾^P⁽^η^|^f⁽^x1:n⁾⁾^dη⁽⁵⁾This integral is typically intractable, but we can approximate it through sampling:

P (f (x) = y|f (x

1:n

J
)) ¹P (f (x) = y f (x J
j=1

1:n

⁾^,^η⁼^η^ˆj ⁾⁽⁶⁾

^where⁽^η^ˆj ^:^j⁼¹^,^.^.^.^,^J⁾^are^sampled^from^P⁽^η^|^f⁽^x1:n⁾⁾^via^an^MCMC^meth^o^d,^e.g.,^slice^sampling^(Neal,2003). MAP estimation can be seen as an approximation to fully Bayesian inference: if we approximate ^the^posterior^P⁽^η^|^f⁽^x1:n⁾⁾^by^a^point^mass^at^the^η^that^maximizes^the^posterior^density,^then^inferencewith the MAP recovers (5).

Yüklə 0,51 Mb.

Dostları ilə paylaş:

1 2 3 4 5 6 7 8 9 ... 19

Optimization

Figure 2: Random functions f drawn from a Gaussian process prior with a power exponential kernel. Each plot corresponds to a different value for the parameter α1, with α1 decreasing from left to right. Varying this parameter creates different beliefs about how quickly f (x) changes with x.

Choosing Hyperparameters

Figure 2: Random functions f drawn from a Gaussian process prior with a power exponential kernel. Each plot corresponds to a different value for the parameter α₁, with α₁ decreasing from left to right. Varying this parameter creates different beliefs about how quickly f (x) changes with x.