Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə110/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   106   107   108   109   110   111   112   113   ...   423
1-Data Mining tarjima

Xi

Yj










f j,Θ(




) =


















2σ2




(6.17)




Xi

(σ







d

.































· π)




























2































This model assumes that all mixture components have the same radius σ, and the cluster in each component is spherical. Note that the exponent in the distribution is the scaled square








  • This is achieved by setting the partial derivative of L(D|M) (see Eq. 6.12) with respect to each parameter in μi and Σ to 0.




6.5. PROBABILISTIC MODEL-BASED ALGORITHMS

177

of the Euclidean distance. How do the E-step and M-step compare to the assignment and re-centering steps of the k-means algorithm?





  1. (E-step) Each data point i has a probability belonging to cluster j, which is propor-tional to the scaled and exponentiated Euclidean distance to each representative Yj . In the k-means algorithm, this is done in a hard way, by picking the best Euclidean distance to any representative Yj .




  1. (M-step) The center Yj is the weighted mean over all the data points where the weight is defined by the probability of assignment to cluster j. The hard version of this is used in k-means, where each data point is either assigned to a cluster or not assigned to a cluster (i.e., analogous to 0-1 probabilities).

When the mixture distribution is defined with more general forms of the Gaussian distribu-tion, the corresponding k -representative algorithm is the Mahalanobis k-means algorithm. It is noteworthy that the exponent of the general Gaussian distribution is the Mahalanobis distance. This implies that special cases of the EM algorithm are equivalent to a soft version of the k-means algorithm, where the exponentiated k-representative distances are used to define soft EM assignment probabilities.


The E-step is structurally similar to the Assign step, and the M-step is similar to the Optimize step in k-representative algorithms. Many mixture component distribu-tions can be expressed in the form K1 · e−K2·Dist(Xi,Yj ), where K1 and K2 are regu-lated by distribution parameters. The log-likelihood of such an exponentiated distribu-tion directly maps to an additive distance term Dist(Xi, Yj ) in the M-step objective func-tion, which is structurally identical to the corresponding additive optimization term in k-representative methods. For many EM models with mixture probability distributions of the form K1 · e−K2·Dist(Xi,Yj ), a corresponding k-representative algorithm can be defined with a distance function Dist(Xi, Yj ).


Practical Considerations


The major practical consideration in mixture modeling is the level of the desired flexibility in defining the mixture components. For example, when each mixture component is defined as a generalized Gaussian, it is more effective at finding clusters of arbitrary shape and orientation. On the other hand, this requires the learning of a larger number of parameters, such as a d × d covariance matrix Σj . When the amount of data available is small, such an approach will not work very well because of overfitting. Overfitting refers to the situation where the parameters learned on a small sample of the true generative model are not reflective of this model because of the noisy variations within the data. Furthermore, as in k-means algorithms, the EM-algorithm can converge to a local optimum.


At the other extreme end, one can pick a spherical Gaussian where each component of the mixture has an identical radius, and also fix the a priori generative probability αi to 1/k. In this case, the EM model will work quite effectively even on very small data sets, because only a single parameter needs to be learned by the algorithm. However, if the different clusters have different shapes, sizes, and orientations, the clustering will be poor even on a large data set. The general rule of thumb is to tailor the model complexity to the available data size. Larger data sets allow more complex models. In some cases, an analyst may have domain knowledge about the distribution of data points in clusters. In these scenarios, the best option is to select the mixture components on the basis of this domain knowledge.


178 CHAPTER 6. CLUSTER ANALYSIS


Figure 6.11: Clusters of arbitrary shape and grid partitions of different granularity


6.6 Grid-Based and Density-Based Algorithms


One of the major problems with distance-based and probabilistic methods is that the shape of the underlying clusters is already defined implicitly by the underlying distance function or probability distribution. For example, a k-means algorithm implicitly assumes a spherical shape for the cluster. Similarly, an EM algorithm with the generalized Gaussian assumes elliptical clusters. In practice, the clusters may be hard to model with a prototypical shape implied by a distance function or probability distribution. To understand this point, consider the clusters illustrated in Fig. 6.11a. It is evident that there are two clusters of sinusoidal shape in the data. However, virtually any choice of representatives in a k-means algorithm will result in the representatives of one of the clusters pulling away data points from the other.


Density-based algorithms are very helpful in such scenarios. The core idea in such algo-rithms is to first identify fine-grained dense regions in the data. These form the “build-ing blocks” for constructing the arbitrarily-shaped clusters. These can also be considered pseudo-data points that need to be re-clustered together carefully into groups of arbitrary shape. Thus, most density- based methods can be considered two-level hierarchical algo-rithms. Because there are a fewer building blocks in the second phase, as compared to the number of data points in the first phase, it is possible to organize them together into com-plex shapes using more detailed analysis. This detailed analysis (or postprocessing) phase is conceptually similar to a single-linkage agglomerative algorithm, which is usually better




Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   106   107   108   109   110   111   112   113   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin