Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	107/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 103 104 105 106 107 108 109 110 ... 423

1-Data Mining tarjima

6.5. PROBABILISTIC MODEL-BASED ALGORITHMS

173

Algorithm GenericTopDownClustering(Data: D, Flat Algorithm: A)

begin
Initialize tree T to root containing D;

repeat
Select a leaf node L in T based on pre-defined criterion; Use algorithm A to split L into L₁ . . . L_k;

Add L₁ . . . L_k as children of L in T ;

until termination criterion;

end

Figure 6.10: Generic top-down meta-algorithm for clustering

rithm recursively splits nodes with a top-down approach until either a certain height of the tree is achieved or each node contains fewer than a predefined number of data objects. A wide variety of algorithms can be designed with diﬀerent instantiations of the algorithm A and growth strategy. Note that the algorithm A can be any arbitrary clustering algorithm, and not just a distance-based algorithm.

6.4.2.1 Bisecting k-Means

The bisecting k-means algorithm is a top-down hierarchical clustering algorithm in which each node is split into exactly two children with a 2-means algorithm. To split a node into two children, several randomized trial runs of the split are used, and the split that has the best impact on the overall clustering objective is used. Several variants of this approach use diﬀerent growth strategies for selecting the node to be split. For example, the heaviest node may be split first, or the node with the smallest distance from the root may be split first. These diﬀerent choices lead to balancing either the cluster weights or the tree height.

6.5 Probabilistic Model-Based Algorithms

Most clustering algorithms discussed in this book are hard clustering algorithms in which each data point is deterministically assigned to a particular cluster. Probabilistic model-based algorithms are soft algorithms in which each data point may have a nonzero assign-ment probability to many (typically all) clusters. A soft solution to a clustering problem may be converted to a hard solution by assigning a data point to a cluster with respect to which it has the largest assignment probability.

The broad principle of a mixture-based generative model is to assume that the data was generated from a mixture of k distributions with probability distributions G₁ . . . G_k. Each distribution G_i represents a cluster and is also referred to as a mixture component. Each data point X_i, where i ∈ {1 . . . n}, is generated by this mixture model as follows:

Select a mixture component with prior probability α_i = P (G_i), where i ∈ {1 . . . k}. Assume that the rth one is selected.

Generate a data point from G_r.

This generative model will be denoted by M. The diﬀerent prior probabilities α_i and the parameters of the diﬀerent distributions G_r are not known in advance. Each distribution G_i is often assumed to be the Gaussian, although any arbitrary (and diﬀerent) family of

174 CHAPTER 6. CLUSTER ANALYSIS

distributions may be assumed for each G_i. The choice of distribution G_i is important because it reflects the user’s a priori understanding about the distribution and shape of the indi-vidual clusters (mixture components). The parameters of the distribution of each mixture component, such as its mean and variance, need to be estimated from the data, so that the overall data has the maximum likelihood of being generated by the model. This is achieved with the expectation-maximization (EM) algorithm. The parameters of the diﬀerent mixture components can be used to describe the clusters. For example, the estimation of the mean of each Gaussian component is analogous to determine the mean of each cluster center in a k-representative algorithm. After the parameters of the mixture components have been estimated, the posterior generative (or assignment) probabilities of data points with respect to each mixture component (cluster) can be determined.

Assume that the probability density function of mixture component G_i is denoted by f i (·). The probability (density function) of the data point X_j being generated by the model is given by the weighted sum of the probability densities over diﬀerent mixture components, where the weight is the prior probability α_i = P (G_i) of the mixture components:

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 103 104 105 106 107 108 109 110 ... 423