Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə102/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   98   99   100   101   102   103   104   105   ...   423
1-Data Mining tarjima

Dist(




,




)=||









||22.

(6.5)




Xi

Yj

Xi

Yj




Here, || · ||p represents the Lp -norm. The expression Dist (Xi, Yj ) can be viewed as the squared error of approximating a data point with its closest representative. Thus, the over-all objective minimizes the sum of square errors over different data points. This is also sometimes referred to as SSE. In such a case, it can be shown1 that the optimal represen-


tative Yj for each of the “optimize” iterative steps is the mean of the data points in cluster Cj . Thus, the only difference between the generic pseudocode of Fig. 6.2 and a k-means pseudocode is the specific instantiation of the distance function Dist(·, ·), and the choice of the representative as the local mean of its cluster.

An interesting variation of the k-means algorithm is to use the local Mahalanobis distance for assignment of data points to clusters. This distance function is discussed in Sect. 3.2.1.6 of Chap. 3. Each cluster Cj has its d ×d own covariance matrix Σj , which can be computed using the data points assigned to that cluster in the previous iteration. The squared Mahalanobis distance between data point Xi and representative Yj with a covariance matrix Σj is defined





1

For

a fixed cluster




assignment C1 . .

. C

k ,

the

gradient of the clustering

objective

function




k

2























































j=1




Cj ||Xi Yj ||




with respect to Yj is

2




Cj (Xi − Yj ). Setting the

gradient to

0 yields




Xi




Xi




the mean of cluster Cj as the optimum value of Yj . Note that the other clusters do not contribute to the gradient, and, therefore, the approach effectively optimizes the local clustering objective function for Cj .



6.3. REPRESENTATIVE-BASED ALGORITHMS

163

Figure 6.4: Strengths and weaknesses of k-means


as follows:

Dist(




,




) = (









j1(









)T .

(6.6)




Xi

Yj

Xi

Yj

Xi

Yj




The use of the Mahalanobis distance is generally helpful when the clusters are elliptically elongated along certain directions, as in the case of Fig. 6.3. The factor Σj1 also provides local density normalization, which is helpful in data sets with varying local density. The resulting algorithm is referred to as the Mahalanobis k-means algorithm.
The k-means algorithm does not work well when the clusters are of arbitrary shape. An example is illustrated in Fig. 6.4a, in which cluster A has a nonconvex shape. The k-means algorithm breaks it up into two parts, and also merges one of these parts with cluster B. Such situations are common in k-means, because it is biased toward finding spherical clusters. Even the Mahalanobis k-means algorithm does not work well in this scenario in spite of its ability to adjust for the elongation of clusters. On the other hand, the Mahalanobis k-means algorithm can adjust well to varying cluster density, as illustrated in Fig. 6.4b. This is because the Mahalanobis method normalizes local distances with the use of a cluster-specific covariance matrix. The data set of Fig. 6.4b cannot be effectively clustered by many density-based algorithms, which are designed to discover arbitrarily shaped clusters (cf. Sect. 6.6). Therefore, different algorithms are suitable in different application settings.

6.3.2 The Kernel k-Means Algorithm


The k-means algorithm can be extended to discovering clusters of arbitrary shape with the use of a method known as the kernel trick. The basic idea is to implicitly transform the data so that arbitrarily shaped clusters map to Euclidean clusters in the new space. Refer to Sect. 10.6.4.1 of Chap. 10 for a brief description of the kernel k-means algorithm. The main problem with the kernel k-means algorithm is that the complexity of computing the kernel matrix alone is quadratically related to the number of data points. Such an approach can effectively discover the arbitrarily shaped clusters of Fig. 6.4a.


164 CHAPTER 6. CLUSTER ANALYSIS


Algorithm GenericMedoids(Database: D, Number of Representatives: k)


begin
Initialize representative set S by selecting from D;


repeat
Create clusters (C1 . . . Ck) by assigning


each point in D to closest representative in S

using the distance function Dist(·, ·);


Determine a pair Xi ∈ D and Yj ∈ S such that replacing Yj ∈ S with Xi leads to the


greatest possible improvement in objective function; Perform the exchange between Xi and Yj only


if improvement is positive;


until no improvement in current iteration;


return (C1 . . . Ck);
end

Figure 6.5: Generic k-medoids algorithm with unspecified hill-climbing strategy


6.3.3 The k-Medians Algorithm


In the k-medians algorithm, the Manhattan distance is used as the objective function of choice. Therefore, the distance function Dist(Xi, Yj ) is defined as follows:





Dist(




,




) = ||Xi − Yj ||1.

(6.7)




Xi

Yj




In such a case, it can be shown that the optimal representative Yj is the median of the data points along each dimension in cluster Cj . This is because the point that has the minimum sum of L1-distances to a set of points distributed on a line is the median of that set. The proof of this result is simple. The definition of a median can be used to show that a perturbation of in either direction from the median cannot strictly reduce the sum of L1-distances. This implies that the median optimizes the sum of the L1-distances to the data points in the set.


As the median is chosen independently along each dimension, the resulting d-dimensional representative will (typically) not belong to the original data set D. The k-medians approach is sometimes confused with the k -medoids approach, which chooses these representatives from the original database D. In this case, the only difference between the generic pseu-docode of Fig. 6.2, and a k-medians variation would be to instantiate the distance function to the Manhattan distance and use the representative as the local median of the cluster (independently along each dimension). The k-medians approach generally selects cluster representatives in a more robust way than k-means, because the median is not as sensitive to the presence of outliers in the cluster as the mean.


6.3.4 The k-Medoids Algorithm


Although the k-medoids algorithm also uses the notion of representatives, its algorithmic structure is different from the generic k-representatives algorithm of Fig. 6.2. The clustering objective function is, however, of the same form as the k -representatives algorithm. The main distinguishing feature of the k-medoids algorithm is that the representatives are always



6.3. REPRESENTATIVE-BASED ALGORITHMS

165

selected from the database D, and this difference necessitates changes to the basic structure of the k-representatives algorithm.


A question arises as to why it is sometimes desirable to select the representatives from D. There are two reasons for this. One reason is that the representative of a k -means cluster may be distorted by outliers in that cluster. In such cases, it is possible for the representative to be located in an empty region which is unrepresentative of most of the data points in that cluster. Such representatives may result in partial merging of different clusters, which is clearly undesirable. This problem can, however, be partially resolved with careful outlier handling and the use of outlier-robust variations such as the k-medians algorithm. The second reason is that it is sometimes difficult to compute the optimal central representative of a set of data points of a complex data type. For example, if the k-representatives clustering algorithm were to be applied on a set of time series of varying lengths, then how should the central representatives be defined as a function of these heterogeneous time-series? In such cases, selecting representatives from the original data set may be very helpful. As long as a representative object is selected from each cluster, the approach will provide reasonably high quality results. Therefore, a key property of the k-medoids algorithm is that it can be defined virtually on any data type, as long as an appropriate similarity or distance function can be defined on the data type. Therefore, k-medoids methods directly relate the problem of distance function design to clustering.


The k-medoids approach uses a generic hill-climbing strategy, in which the representative set S is initialized to a set of points from the original database D. Subsequently, this set S is iteratively improved by exchanging a single point from set S with a data point selected from the database D. This iterative exchange can be viewed as a hill-climbing strategy, because the set S implicitly defines a solution to the clustering problem, and each exchange can be viewed as a hill-climbing step. So what should be the criteria for the exchange, and when should one terminate?


Clearly, in order for the clustering algorithm to be successful, the hill-climbing approach should at least improve the objective function of the problem to some extent. Several choices arise in terms of how the exchange can be performed:





  1. One can try all |S| · |D| possibilities for replacing a representative in S with a data point in D and then select the best one. However, this is extremely expensive because the computation of the incremental objective function change for each of the |S| · |D| alternatives will require time proportional to the original database size.




  1. A simpler solution is to use a randomly selected set of r pairs (Xi, Yj ) for possible exchange, where Xi is selected from the database D, and Yj is selected from the representative set S. The best of these r pairs is used for the exchange.

The second solution requires time proportional to r times the database size but is usually practically implementable for databases of modest size. The solution is said to have con-verged when the objective function does not improve, or if the average objective function improvement is below a user-specified threshold in the previous iteration. The k-medoids approach is generally much slower than the k-means method but has greater applicability to different data types. The next chapter will introduce the CLARANS algorithm, which is a scalable version of the k-medoids framework.


Practical and Implementation Issues





A number of practical issues arise in the proper implementation of all representative-based algorithms, such as the k-means, k-medians, and k-medoids algorithms. These issues relate



166 CHAPTER 6. CLUSTER ANALYSIS

to the initialization criteria, the choice of the number of clusters k, and the presence of outliers.


The simplest initialization criteria is to either select points randomly from the domain of the data space, or to sample the original database D. Sampling the original database D is generally superior to sampling the data space, because it leads to better statistical representatives of the underlying data. The k-representatives algorithm seems to be sur-prisingly robust to the choice of initialization, though it is possible for the algorithm to create suboptimal clusters. One possible solution is to sample more data points from D than the required number k, and use a more expensive hierarchical agglomerative cluster-ing approach to create k robust centroids. Because these centroids are more representative of the database D, this provides a better starting point for the algorithm.


A very simple approach, which seems to work surprisingly well, is to select the initial representatives as centroids of m randomly chosen samples of points for some user-selected parameter m. This will ensure that the initial centroids are not too biased by any particular outlier. Furthermore, while all these centroid representatives will be approximately equal to the mean of the data, they will typically be slightly biased toward one cluster or another because of random variations across different samples. Subsequent iterations of k-means will eventually associate each of these representatives with a cluster.


The presence of outliers will typically have a detrimental impact on such algorithms. This can happen in cases where the initialization procedure selects an outlier as one of the initial centers. Although a k-medoids algorithm will typically discard an outlier represen-tative during an iterative exchange, a k-center approach can become stuck with a singleton cluster or an empty cluster in subsequent iterations. In such cases, one solution is to add an additional step in the iterative portion of the algorithm that discards centers with very small clusters and replaces them with randomly chosen points from the data.


The number of clusters k is a parameter used by this approach. Section 6.9.1.1 on cluster validity provides an approximate method for selecting the number of clusters k. As discussed in Sect. 6.9.1.1, this approach is far from perfect. The number of natural clusters is often difficult to determine using automated methods. Because the number of natural clusters is not known a priori, it may sometimes be desirable to use a larger value of k than the analyst’s “guess” about the true natural number of clusters in the data. This will result in the splitting of some of the data clusters into multiple representatives, but it is less likely for clusters to be incorrectly merged. As a postprocessing step, it may be possible to merge some of the clusters based on the intercluster distances. Some hybrid agglomerative and partitioning algorithms include a merging step within the k-representative procedure. Refer to the bibliographic notes for references to these algorithms.


6.4 Hierarchical Clustering Algorithms

Hierarchical algorithms typically cluster the data with distances. However, the use of dis-tance functions is not compulsory. Many hierarchical algorithms use other clustering meth-ods, such as density- or graph-based methods, as a subroutine for constructing the hierarchy.


So why are hierarchical clustering methods useful from an application-centric point of view? One major reason is that different levels of clustering granularity provide different application-specific insights. This provides a taxonomy of clusters, which may be browsed for semantic insights. As a specific example, consider the taxonomy2 of Web pages created by the well-known Open Directory Project (ODP). In this case, the clustering has been








  • http://www.dmoz.org


Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   98   99   100   101   102   103   104   105   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin