6.5. PROBABILISTIC MODEL-BASED ALGORITHMS
|
175
|
corresponding to random assignments of data points to mixture components), and proceeds as follows:
(E-step) Given the current value of the parameters in Θ, estimate the posterior proba-bility P (Gi|Xj , Θ) of the component Gi having been selected in the generative process, given that we have observed data point Xj . The quantity P (Gi|Xj , Θ) is also the soft cluster assignment probability that we are trying to estimate. This step is executed for each data point Xj and mixture component Gi.
(M-step) Given the current probabilities of assignments of data points to clusters, use the maximum likelihood approach to determine the values of all the parameters in Θ that maximize the log-likelihood fit on the basis of current assignments.
The two steps are executed repeatedly in order to improve the maximum likelihood criterion. The algorithm is said to converge when the objective function does not improve significantly in a certain number of iterations. The details of the E-step and the M-step will now be explained.
The E-step uses the currently available model parameters to compute the probability density of the data point Xj being generated by each component of the mixture. This proba-bility density is used to compute the Bayes probability that the data point Xj was generated by component Gi (with model parameters fixed to the current set of the parameters Θ):
|
|
|
|
P (Gi) · P (
|
|
|Gi, Θ)
|
|
|
P (
|
Gi|
|
|
,Θ) =
|
Xj
|
=
|
|
X
|
|
|
|
|
j
|
|
k
|
|
|
|
|
|
|
|
|
r=1 P (Gr) · P (Xj |Gr, Θ)
|
|
|
αi · f i,Θ(
|
|
)
|
.
|
|
|
Xj
|
(6.13)
|
|
|
|
k
|
|
|
|
|
|
|
|
αr · f r,Θ(Xj )
|
|
|
r=1
|
|
|
As you will learn in Chap. 10 on classification, Eq. 6.13 is exactly the mechanism with which a Bayes classifier assigns previously unseen data points to categories (classes). A superscript Θ has been added to the probability density functions to denote the fact that they are evaluated for current model parameters Θ.
The M-step requires the optimization of the parameters for each probability distribu-tion under the assumption that the E-step has provided the “correct” soft assignment. To optimize the fit, the partial derivative of the log-likelihood fit with respect to corresponding model parameters needs to be computed and set to zero. Without specifically describing the details of these algebraic steps, the values of the model parameters that are computed as a result of the optimization are described here.
The value of each αi is estimated as the current weighted fraction of points assigned to cluster i, where a weight of P (Gi|Xj , Θ) is associated with data point Xj . Therefore, we have:
αi = P ( Gi) = . (6.14)
In practice, in order to obtain more robust results for smaller data sets, the expected number of data points belonging to each cluster in the numerator is augmented by 1, and the total number of points in the denominator is n + k. Therefore, the estimated value is as follows:
|
1 +
|
n
|
|
|
|
|
|
|
|
|
|
|
αi =
|
j=1 P (Gi|Xj , Θ)
|
.
|
(6.15)
|
|
|
k + n
|
|
|
|
|
|
|
This approach is also referred to as Laplacian smoothing.
To determine the other parameters for component i, the value of P (Gi|Xj , Θ) is treated as a weight of that data point. Consider a Gaussian mixture model in d dimensions, in which the distribution of the ith component is defined as follows:
f i,Θ(
|
|
) =
|
1
|
e−
|
1
|
(
|
|
−
|
|
)Σi−1(
|
|
−
|
|
).
|
(6.16)
|
|
|
Xj
|
Xj
|
|
Xj
|
μi
|
μi
|
|
2
|
|
|
|
|
|
|Σi|(2 · π)(d/2)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
176 CHAPTER 6. CLUSTER ANALYSIS
Here, μi is the d-dimensional mean vector of the ith Gaussian component, and Σi is the d × d covariance matrix of the generalized Gaussian distribution of the ith component. The notation |Σi| denotes the determinant of the covariance matrix. It can be shown3 that the maximum-likelihood estimation of μi and Σi yields the (probabilistically weighted) means and covariance matrix of the data points in that component. These probabilistic weights were derived from the assignment probabilities in the E-step. Interestingly, this is exactly how the representatives and covariance matrices of the Mahalanobis k-means approach are derived in Sect. 6.3. The only difference was that the data points were not weighted because hard assignments were used by the deterministic k-means algorithm. Note that the term in the exponent of the Gaussian distribution is the square of the Mahalanobis distance.
The E-step and the M-step can be iteratively executed to convergence to determine the optimal parameter set Θ. At the end of the process, a probabilistic model is obtained that describes the entire data set in terms of a generative model. The model also provides soft assignment probabilities P (Gi|Xj , Θ) of the data points, on the basis of the final execution of the E-step.
In practice, to minimize the number of estimated parameters, the non-diagonal entries of Σi are often set to 0. In such cases, the determinant of Σi simplifies to the product of the variances along the individual dimensions. This is equivalent to using the square of the Minkowski distance in the exponent. If all diagonal entries are further constrained to have the same value, then it is equivalent to using the Euclidean distance, and all components of the mixture will have spherical clusters. Thus, different choices and complexities of mix-ture model distributions provide different levels of flexibility in representing the probability distribution of each component.
This two-phase iterative approach is similar to representative-based algorithms. The E-step can be viewed as a soft version of the assign step in distance-based partitioning algorithms. The M-step is reminiscent of the optimize step, in which optimal component-specific parameters are learned on the basis of the fixed assignment. The distance term in the exponent of the probability distribution provides the natural connection between probabilistic and distance-based algorithms. This connection is discussed in the next section.
6.5.1 Relationship of EM to k-means and Other Representative Methods
The EM algorithm provides an extremely flexible framework for probabilistic clustering, and certain special cases can be viewed as soft versions of distance-based clustering methods. As a specific example, consider the case where all a priori generative probabilities αi are fixed to 1/k as a part of the model setting. Furthermore, all components of the mixture have the same radius σ along all directions, and the mean of the jth cluster is assumed to be Yj . Thus, the only parameters to be learned are σ, and Y1 . . . Yk. In that case, the jth component of the mixture has the following distribution:
Dostları ilə paylaş: |