Ensemble clustering: As discussed in the previous chapter, the different models for clustering may produce clusters that are very different from one another. Which of these clusterings is the best solution? Often, there is no single answer to this question. Rather the knowledge from multiple models may be combined to gain a more unified insight from the clustering process. Ensemble clustering can be viewed as a meta-algorithm, which is used to gain more significant insights from multiple models.
This chapter is organized as follows: Section 7.2 discusses algorithms for clustering cat-egorical data. Scalable clustering algorithms are discussed in Sect. 7.3. High-dimensional algorithms are addressed in Sect. 7.4. Semisupervised clustering algorithms are discussed in Sect. 7.5. Interactive and visual clustering algorithms are discussed in Sect. 7.6. Ensemble clustering methods are presented in Sect. 7.7. Section 7.8 discusses the different applications of data clustering. Section 7.9 provides a summary.
7.2 Clustering Categorical Data
The problem of categorical (or discrete) data clustering is challenging because most of the primitive operations in data clustering, such as distance computation, representative determination, and density estimation, are naturally designed for numeric data. A salient observation is that categorical data can always be converted to binary data with the use of
Table 7.1: Example of a 2-dimensional cat-egorical data cluster
|
Data
|
|
(Color, Shape)
|
|
|
|
|
|
|
|
|
|
1
|
|
(Blue, Square)
|
|
2
|
|
(Red, Circle)
|
|
3
|
|
(Green, Cube)
|
|
4
|
|
(Blue, Cube)
|
|
5
|
|
(Green, Square)
|
|
6
|
|
(Red, Circle)
|
|
7
|
|
(Blue, Square)
|
|
8
|
|
(Green, Cube)
|
|
9
|
|
(Blue, Circle)
|
|
10
|
|
(Green, Cube)
|
Table 7.2: Mean histogram and modes for categorical data cluster
|
Attribute
|
|
Histogram
|
Mode
|
|
|
|
|
|
|
|
|
|
|
|
Color
|
|
Blue= 0.4
|
Blue or
|
|
|
|
Green = 0.4
|
Green
|
|
|
|
Red = 0.2
|
|
|
|
|
|
|
|
Shape
|
|
Cube = 0.4
|
Cube
|
|
|
|
Square = 0.3
|
|
|
|
|
Circle = 0.3
|
|
|
|
|
|
|
the binarization process discussed in Chap. 2. It is often easier to work with binary data because it is also a special case of numeric data. However, in such cases, the algorithms need to be tailored to binary data.
This chapter will discuss a wide variety of algorithms for clustering categorical data. The specific challenges associated with applying the various classical methods to categorical data will be addressed in detail along with the required modifications.
7.2.1 Representative-Based Algorithms
The centroid-based representative algorithms, such as k-means, require the repeated deter-mination of centroids of clusters, and the determination of similarity between the centroids and the original data points. As discussed in Sect. 6.3 of the previous chapter, these algo-rithms iteratively determine the centroids of clusters, and then assign data points to their closest centroid. At a higher level, these steps remain the same for categorical data. However, the specifics of both steps are affected by the categorical data representation as follows:
Centroid of a categorical data set: All representative-based algorithms require the determination of a central representative of a set of objects. In the case of numerical data, this is achieved very naturally by averaging. However, for categorical data, the equivalent centroid is a probability histogram of values on each attribute. For each attribute i, and possible value vj , the histogram value pij represents the fraction of the number of objects in the cluster for which attribute i takes on value vj . Therefore, for a d-dimensional data set, the centroid of a cluster of points is a set of d differ-ent histograms, representing the probability distribution of categorical values of each attribute in the cluster. If ni is the number of distinct values of attribute i, then such an approach will require O(ni) space to represent the centroid of the ith attribute. A cluster of 2-dimensional data points with attributes Color and Shape is illustrated in Table 7.1. The corresponding histograms for the Color and Shape attributes are illustrated in Table 7.2. Note that the probability values over a particular attribute always sum to one unit.
Calculating similarity to centroids: A variety of similarity functions between a pair of categorical records are introduced in Sect. 3.2.2 of Chap. 3. The simplest of these is match-based similarity. However, in this case, the goal is to determine the similarity
208 CHAPTER 7. CLUSTER ANALYSIS: ADVANCED CONCEPTS
between a probability histogram (corresponding to a representative) and a categor-ical attribute value. If the attribute i takes on the value vj for a particular data record, then the analogous match-based similarity is its histogram-based probabil-ity pij . These probabilities are summed up over the different attributes to determine the total similarity. Each data record is assigned to the centroid with the greatest similarity.
The other steps of the k-means algorithm remain the same as for the case of numeric data. The effectiveness of a k -means algorithm is highly dependent on the distribution of the attribute values in the underlying data. For example, if the attribute values are highly skewed, as in the case of market basket data, the histogram-based variation of the match -based measure may perform poorly. This is because this measure treats all attribute values evenly, however, rare attribute values should be treated with greater importance in such cases. This can be achieved by a prekshiprocessing phase that assigns a weight to each categorical attribute value, which is the inverse of its global frequency. Therefore, the categorical data records now have weights associated with each attribute. The presence of these weights will affect both probability histogram generation and match-based similarity computation.
7.2.1.1 k-Modes Clustering
In k-modes clustering, each attribute value for a representative is chosen as the mode of the categorical values for that attribute in the cluster. The mode of a set of categorical values is the value with the maximum frequency in the set. The modes of each attribute for the cluster of ten points in Table 7.1 are illustrated in Table 7.2. Intuitively, this corresponds to the cat-egorical value vj for each attribute i for which the frequency histogram has the largest value of pij . The mode of an attribute may not be unique if two categorical values have the same frequency. In the case of Table 7.2, two possible values of the mode are (Blue, Cube), and (Green, Cube). Any of these could be used as the representative, if a random tie-breaking criterion is used. The mode-based representative may not be drawn from the original data set because the mode of each attribute is determined independently. Therefore, the partic-ular combination of d-dimensional modes obtained for the representative may not belong to the original data. One advantage of the mode-based approach is that the representative is also a categorical data record, rather than a histogram. Therefore, it is easier to use a richer set of similarity functions for computing the distances between data points and their modes. For example, the inverse occurrence frequency-based similarity function, described in Chap. 3, may be used to normalize for the skew in the attribute values. On the other hand, when the attribute values in a categorical data set are naturally skewed, as in market basket data, the use of modes may not be informative. For example, for a market basket data set, all item attributes for the representative point may be set to the value of 0 because of the natural sparsity of the data set. Nevertheless, for cases where the attribute values are more evenly distributed, the k -modes approach can be used effectively. One way of making the k- modes algorithm work well in cases where the attribute values are distributed unevenly, is by dividing the cluster-specific frequency of an attribute by its (global) occurrence fre-quency to determine a normalized frequency. This essentially corrects for the differential global distribution of different attribute values. The modes of this normalized frequency are used. The most commonly used similarity function is the match-based similarity met-ric, discussed in Sect. 3.2.2 of Chap. 3. However, for biased categorical data distributions, the inverse occurrence frequency should be used for normalizing the similarity function, as discussed in Chap. 3. This can be achieved indirectly by weighting each attribute of
Dostları ilə paylaş: |