Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	126/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 122 123 124 125 126 127 128 129 ... 423

1-Data Mining tarjima

Ensemble clustering: As discussed in the previous chapter, the diﬀerent models for clustering may produce clusters that are very diﬀerent from one another. Which of these clusterings is the best solution? Often, there is no single answer to this question. Rather the knowledge from multiple models may be combined to gain a more unified insight from the clustering process. Ensemble clustering can be viewed as a meta-algorithm, which is used to gain more significant insights from multiple models.

This chapter is organized as follows: Section 7.2 discusses algorithms for clustering cat-egorical data. Scalable clustering algorithms are discussed in Sect. 7.3. High-dimensional algorithms are addressed in Sect. 7.4. Semisupervised clustering algorithms are discussed in Sect. 7.5. Interactive and visual clustering algorithms are discussed in Sect. 7.6. Ensemble clustering methods are presented in Sect. 7.7. Section 7.8 discusses the diﬀerent applications of data clustering. Section 7.9 provides a summary.

7.2 Clustering Categorical Data

The problem of categorical (or discrete) data clustering is challenging because most of the primitive operations in data clustering, such as distance computation, representative determination, and density estimation, are naturally designed for numeric data. A salient observation is that categorical data can always be converted to binary data with the use of

7.2. CLUSTERING CATEGORICAL DATA

207

Table 7.1: Example of a 2-dimensional cat-egorical data cluster

	Data		(Color, Shape)


	1		(Blue, Square)
	2		(Red, Circle)
	3		(Green, Cube)
	4		(Blue, Cube)
	5		(Green, Square)
	6		(Red, Circle)
	7		(Blue, Square)
	8		(Green, Cube)
	9		(Blue, Circle)
	10		(Green, Cube)

Table 7.2: Mean histogram and modes for categorical data cluster

Attribute	Histogram	Mode


Color	Blue= 0.4	Blue or
	Green = 0.4	Green
	Red = 0.2

Shape	Cube = 0.4	Cube
	Square = 0.3
	Circle = 0.3

the binarization process discussed in Chap. 2. It is often easier to work with binary data because it is also a special case of numeric data. However, in such cases, the algorithms need to be tailored to binary data.

This chapter will discuss a wide variety of algorithms for clustering categorical data. The specific challenges associated with applying the various classical methods to categorical data will be addressed in detail along with the required modifications.

7.2.1 Representative-Based Algorithms

The centroid-based representative algorithms, such as k-means, require the repeated deter-mination of centroids of clusters, and the determination of similarity between the centroids and the original data points. As discussed in Sect. 6.3 of the previous chapter, these algo-rithms iteratively determine the centroids of clusters, and then assign data points to their closest centroid. At a higher level, these steps remain the same for categorical data. However, the specifics of both steps are aﬀected by the categorical data representation as follows:

Centroid of a categorical data set: All representative-based algorithms require the determination of a central representative of a set of objects. In the case of numerical data, this is achieved very naturally by averaging. However, for categorical data, the equivalent centroid is a probability histogram of values on each attribute. For each attribute i, and possible value v_j , the histogram value p_ij represents the fraction of the number of objects in the cluster for which attribute i takes on value v_j . Therefore, for a d-dimensional data set, the centroid of a cluster of points is a set of d diﬀer-ent histograms, representing the probability distribution of categorical values of each attribute in the cluster. If n_i is the number of distinct values of attribute i, then such an approach will require O(n_i) space to represent the centroid of the ith attribute. A cluster of 2-dimensional data points with attributes Color and Shape is illustrated in Table 7.1. The corresponding histograms for the Color and Shape attributes are illustrated in Table 7.2. Note that the probability values over a particular attribute always sum to one unit.

Calculating similarity to centroids: A variety of similarity functions between a pair of categorical records are introduced in Sect. 3.2.2 of Chap. 3. The simplest of these is match-based similarity. However, in this case, the goal is to determine the similarity

208 CHAPTER 7. CLUSTER ANALYSIS: ADVANCED CONCEPTS

between a probability histogram (corresponding to a representative) and a categor-ical attribute value. If the attribute i takes on the value v_j for a particular data record, then the analogous match-based similarity is its histogram-based probabil-ity p_ij . These probabilities are summed up over the diﬀerent attributes to determine the total similarity. Each data record is assigned to the centroid with the greatest similarity.

The other steps of the k-means algorithm remain the same as for the case of numeric data. The eﬀectiveness of a k -means algorithm is highly dependent on the distribution of the attribute values in the underlying data. For example, if the attribute values are highly skewed, as in the case of market basket data, the histogram-based variation of the match -based measure may perform poorly. This is because this measure treats all attribute values evenly, however, rare attribute values should be treated with greater importance in such cases. This can be achieved by a prekshiprocessing phase that assigns a weight to each categorical attribute value, which is the inverse of its global frequency. Therefore, the categorical data records now have weights associated with each attribute. The presence of these weights will aﬀect both probability histogram generation and match-based similarity computation.

7.2.1.1 k-Modes Clustering

In k-modes clustering, each attribute value for a representative is chosen as the mode of the categorical values for that attribute in the cluster. The mode of a set of categorical values is the value with the maximum frequency in the set. The modes of each attribute for the cluster of ten points in Table 7.1 are illustrated in Table 7.2. Intuitively, this corresponds to the cat-egorical value v_j for each attribute i for which the frequency histogram has the largest value of p_ij . The mode of an attribute may not be unique if two categorical values have the same frequency. In the case of Table 7.2, two possible values of the mode are (Blue, Cube), and (Green, Cube). Any of these could be used as the representative, if a random tie-breaking criterion is used. The mode-based representative may not be drawn from the original data set because the mode of each attribute is determined independently. Therefore, the partic-ular combination of d-dimensional modes obtained for the representative may not belong to the original data. One advantage of the mode-based approach is that the representative is also a categorical data record, rather than a histogram. Therefore, it is easier to use a richer set of similarity functions for computing the distances between data points and their modes. For example, the inverse occurrence frequency-based similarity function, described in Chap. 3, may be used to normalize for the skew in the attribute values. On the other hand, when the attribute values in a categorical data set are naturally skewed, as in market basket data, the use of modes may not be informative. For example, for a market basket data set, all item attributes for the representative point may be set to the value of 0 because of the natural sparsity of the data set. Nevertheless, for cases where the attribute values are more evenly distributed, the k -modes approach can be used eﬀectively. One way of making the k- modes algorithm work well in cases where the attribute values are distributed unevenly, is by dividing the cluster-specific frequency of an attribute by its (global) occurrence fre-quency to determine a normalized frequency. This essentially corrects for the diﬀerential global distribution of diﬀerent attribute values. The modes of this normalized frequency are used. The most commonly used similarity function is the match-based similarity met-ric, discussed in Sect. 3.2.2 of Chap. 3. However, for biased categorical data distributions, the inverse occurrence frequency should be used for normalizing the similarity function, as discussed in Chap. 3. This can be achieved indirectly by weighting each attribute of

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 122 123 124 125 126 127 128 129 ... 423