Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	109/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 105 106 107 108 109 110 111 112 ... 423

1-Data Mining tarjima

6.5. PROBABILISTIC MODEL-BASED ALGORITHMS

175

corresponding to random assignments of data points to mixture components), and proceeds as follows:

(E-step) Given the current value of the parameters in Θ, estimate the posterior proba-bility P (G_i|X_j , Θ) of the component G_i having been selected in the generative process, given that we have observed data point X_j . The quantity P (G_i|X_j , Θ) is also the soft cluster assignment probability that we are trying to estimate. This step is executed for each data point X_j and mixture component G_i.
(M-step) Given the current probabilities of assignments of data points to clusters, use the maximum likelihood approach to determine the values of all the parameters in Θ that maximize the log-likelihood fit on the basis of current assignments.

The two steps are executed repeatedly in order to improve the maximum likelihood criterion. The algorithm is said to converge when the objective function does not improve significantly in a certain number of iterations. The details of the E-step and the M-step will now be explained.

The E-step uses the currently available model parameters to compute the probability density of the data point X_j being generated by each component of the mixture. This proba-bility density is used to compute the Bayes probability that the data point X_j was generated by component G_i (with model parameters fixed to the current set of the parameters Θ):

					P (G_i) · P (		\|G_i, Θ)
P (	G_i\|		,Θ) =			X_j		=
		X

		j		k
					_r₌₁P (G_r) · P (X_j \|G_r, Θ)

α_i · f ^i,^Θ(			)	.
		X_j			(6.13)

k
	α_r · f ^r,^Θ(X_j )
r=1

As you will learn in Chap. 10 on classification, Eq. 6.13 is exactly the mechanism with which a Bayes classifier assigns previously unseen data points to categories (classes). A superscript Θ has been added to the probability density functions to denote the fact that they are evaluated for current model parameters Θ.

The M-step requires the optimization of the parameters for each probability distribu-tion under the assumption that the E-step has provided the “correct” soft assignment. To optimize the fit, the partial derivative of the log-likelihood fit with respect to corresponding model parameters needs to be computed and set to zero. Without specifically describing the details of these algebraic steps, the values of the model parameters that are computed as a result of the optimization are described here.

The value of each α_i is estimated as the current weighted fraction of points assigned to cluster i, where a weight of P (G_i|X_j , Θ) is associated with data point X_j . Therefore, we have:

α_i = P (G_i) = . (6.14)

In practice, in order to obtain more robust results for smaller data sets, the expected number of data points belonging to each cluster in the numerator is augmented by 1, and the total number of points in the denominator is n + k. Therefore, the estimated value is as follows:

	1 +	n

α_i =		_j₌₁^P (^Gi^\|Xj ^, Θ)	.	(6.15)
		k + n

This approach is also referred to as Laplacian smoothing.

To determine the other parameters for component i, the value of P (G_i|X_j , Θ) is treated as a weight of that data point. Consider a Gaussian mixture model in d dimensions, in which the distribution of the ith component is defined as follows:

_fi,Θ(

) =

e⁻

(

−

)Σ_i⁻¹(

−

)_.

(6.16)

^Xj

X_j

^μi

|Σ_i|(2 · π)⁽^d/²⁾

176 CHAPTER 6. CLUSTER ANALYSIS

Here, μ_i is the d-dimensional mean vector of the ith Gaussian component, and Σ_i is the d × d covariance matrix of the generalized Gaussian distribution of the ith component. The notation |Σ_i| denotes the determinant of the covariance matrix. It can be shown3 that the maximum-likelihood estimation of μ_i and Σ_i yields the (probabilistically weighted) means and covariance matrix of the data points in that component. These probabilistic weights were derived from the assignment probabilities in the E-step. Interestingly, this is exactly how the representatives and covariance matrices of the Mahalanobis k-means approach are derived in Sect. 6.3. The only diﬀerence was that the data points were not weighted because hard assignments were used by the deterministic k-means algorithm. Note that the term in the exponent of the Gaussian distribution is the square of the Mahalanobis distance.

The E-step and the M-step can be iteratively executed to convergence to determine the optimal parameter set Θ. At the end of the process, a probabilistic model is obtained that describes the entire data set in terms of a generative model. The model also provides soft assignment probabilities P (G_i|X_j , Θ) of the data points, on the basis of the final execution of the E-step.

In practice, to minimize the number of estimated parameters, the non-diagonal entries of Σ_i are often set to 0. In such cases, the determinant of Σ_i simplifies to the product of the variances along the individual dimensions. This is equivalent to using the square of the Minkowski distance in the exponent. If all diagonal entries are further constrained to have the same value, then it is equivalent to using the Euclidean distance, and all components of the mixture will have spherical clusters. Thus, diﬀerent choices and complexities of mix-ture model distributions provide diﬀerent levels of flexibility in representing the probability distribution of each component.

This two-phase iterative approach is similar to representative-based algorithms. The E-step can be viewed as a soft version of the assign step in distance-based partitioning algorithms. The M-step is reminiscent of the optimize step, in which optimal component-specific parameters are learned on the basis of the fixed assignment. The distance term in the exponent of the probability distribution provides the natural connection between probabilistic and distance-based algorithms. This connection is discussed in the next section.

6.5.1 Relationship of EM to k-means and Other Representative Methods

The EM algorithm provides an extremely flexible framework for probabilistic clustering, and certain special cases can be viewed as soft versions of distance-based clustering methods. As a specific example, consider the case where all a priori generative probabilities α_i are fixed to 1/k as a part of the model setting. Furthermore, all components of the mixture have the same radius σ along all directions, and the mean of the jth cluster is assumed to be Y_j . Thus, the only parameters to be learned are σ, and Y₁ . . . Y_k. In that case, the jth component of the mixture has the following distribution:

−

||²

_e−

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 105 106 107 108 109 110 111 112 ... 423