Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə107/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   103   104   105   106   107   108   109   110   ...   423
1-Data Mining tarjima

6.5. PROBABILISTIC MODEL-BASED ALGORITHMS

173

Algorithm GenericTopDownClustering(Data: D, Flat Algorithm: A)


begin
Initialize tree T to root containing D;


repeat
Select a leaf node L in T based on pre-defined criterion; Use algorithm A to split L into L1 . . . Lk;


Add L1 . . . Lk as children of L in T ;

until termination criterion;


end

Figure 6.10: Generic top-down meta-algorithm for clustering

rithm recursively splits nodes with a top-down approach until either a certain height of the tree is achieved or each node contains fewer than a predefined number of data objects. A wide variety of algorithms can be designed with different instantiations of the algorithm A and growth strategy. Note that the algorithm A can be any arbitrary clustering algorithm, and not just a distance-based algorithm.


6.4.2.1 Bisecting k-Means


The bisecting k-means algorithm is a top-down hierarchical clustering algorithm in which each node is split into exactly two children with a 2-means algorithm. To split a node into two children, several randomized trial runs of the split are used, and the split that has the best impact on the overall clustering objective is used. Several variants of this approach use different growth strategies for selecting the node to be split. For example, the heaviest node may be split first, or the node with the smallest distance from the root may be split first. These different choices lead to balancing either the cluster weights or the tree height.


6.5 Probabilistic Model-Based Algorithms


Most clustering algorithms discussed in this book are hard clustering algorithms in which each data point is deterministically assigned to a particular cluster. Probabilistic model-based algorithms are soft algorithms in which each data point may have a nonzero assign-ment probability to many (typically all) clusters. A soft solution to a clustering problem may be converted to a hard solution by assigning a data point to a cluster with respect to which it has the largest assignment probability.


The broad principle of a mixture-based generative model is to assume that the data was generated from a mixture of k distributions with probability distributions G1 . . . Gk. Each distribution Gi represents a cluster and is also referred to as a mixture component. Each data point Xi, where i ∈ {1 . . . n}, is generated by this mixture model as follows:





  1. Select a mixture component with prior probability αi = P (Gi), where i ∈ {1 . . . k}. Assume that the rth one is selected.




  1. Generate a data point from Gr.

This generative model will be denoted by M. The different prior probabilities αi and the parameters of the different distributions Gr are not known in advance. Each distribution Gi is often assumed to be the Gaussian, although any arbitrary (and different) family of


174 CHAPTER 6. CLUSTER ANALYSIS


distributions may be assumed for each Gi. The choice of distribution Gi is important because it reflects the user’s a priori understanding about the distribution and shape of the indi-vidual clusters (mixture components). The parameters of the distribution of each mixture component, such as its mean and variance, need to be estimated from the data, so that the overall data has the maximum likelihood of being generated by the model. This is achieved with the expectation-maximization (EM) algorithm. The parameters of the different mixture components can be used to describe the clusters. For example, the estimation of the mean of each Gaussian component is analogous to determine the mean of each cluster center in a k-representative algorithm. After the parameters of the mixture components have been estimated, the posterior generative (or assignment) probabilities of data points with respect to each mixture component (cluster) can be determined.


Assume that the probability density function of mixture component Gi is denoted by f i (·). The probability (density function) of the data point Xj being generated by the model is given by the weighted sum of the probability densities over different mixture components, where the weight is the prior probability αi = P (Gi) of the mixture components:




Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   103   104   105   106   107   108   109   110   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin