Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə122/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   118   119   120   121   122   123   124   125   ...   423
1-Data Mining tarjima

6.9. CLUSTER VALIDATION



























Cluster Indices

1

2

3

4

























1

97

0

2

1







2

5

191

1

3







3

4

3

87

6







4

0

0

5

195












































199




















































Cluster Indices




1

2

3

4




























1




33

30

17

20







2




51

101

24

24







3




24

23

31

22







4




46

40

44

70































Figure 6.25: Confusion matrix for a cluster-ing of good quality

Figure 6.26: Confusion matrix for a cluster-ing of poor quality





are still preferable to internal methods because they can usually avoid consistent bias in evaluations, when used over multiple data sets. In the following discussion, the term “class labels” will be used interchangeably to refer to either cluster identifiers in a synthetic data set or class labels in a real data set.

One of the problems is that the number of natural clusters in the data may not reflect the number of class labels (or cluster identifiers). The number of class labels is denoted by kt, which represents the true or ground-truth number of clusters. The number of clusters determined by the algorithm is denoted by kd. In some settings, the number of true clusters kt is equal to the number of algorithm-determined clusters kd, though this is often not the case. In cases where kd = kt, it is particularly helpful to create a confusion matrix, which relates the mapping of the true clusters to those determined by the algorithm. Each row i corresponds to the class label (ground-truth cluster) i, and each column j corresponds to the points in algorithm-determined cluster j. Therefore, the (i, j)th entry of this matrix is equal to the number of data points in the true cluster i, which are mapped to the algorithm-determined cluster j. The sum of the values across a particular row i will always be the same across different clustering algorithms because it reflects the size of ground-truth cluster i in the data set.


When the clustering is of high quality, it is usually possible to permute the rows and columns of this confusion matrix, so that only the diagonal entries are large. On the other hand, when the clustering is of poor quality, the entries across the matrix will be more evenly distributed. Two examples of confusion matrices are illustrated in Figs. 6.25 and 6.26, respectively. The first clustering is obviously of much better quality than the second.


The confusion matrix provides an intuitive method to visually assess the clustering. However, for larger confusion matrices, this may not be a practical solution. Furthermore, while confusion matrices can also be created for cases where kd = kt, it is much harder to assess the quality of a particular clustering by visual inspection. Therefore, it is important to design hard measures to evaluate the overall quality of the confusion matrix. Two commonly used measures are the cluster purity, and class-based Gini index. Let mij represent the number of data points from class (ground-truth cluster) i that are mapped to (algorithm-determined) cluster j. Here, i is drawn from the range [1, k t], and j is drawn from the range [1, kd]. Also assume that the number of data points in true cluster i are denoted by Ni, and the number of data points in algorithm-determined cluster j are denoted by Mj . Therefore, the number of data points in different clusters can be related as follows:








kd

i = 1 . . . kt







Ni =

mij

(6.46)







j=1













kt

j = 1 . . . kd







Mj =

mij

(6.47)




i=1

200 CHAPTER 6. CLUSTER ANALYSIS

A high-quality algorithm-determined cluster j should contain data points that are largely dominated by a single class. Therefore, for a given algorithm-determined cluster j, the number of data points Pj in its dominant class is equal to the maximum of the values of mij over different values of ground truth cluster i:





Pj = maximij .

(6.48)

A high-quality clustering will result in values of Pj ≤ Mj , which are very close to Mj . Then, the overall purity is given by the following:








kd







Purity =

j=1 Pj

(6.49)







.













kd










j=1 Mj







High values of the purity are desirable. The cluster purity can be computed in two differ-ent ways. The method discussed above computes the purity of each algorithm-determined cluster (with respect to ground-truth clusters), and then computes the aggregate purity on this basis. The second way can compute the purity of each ground-truth cluster with respect to the algorithm-determined clusters. The two methods will not lead to the same results, especially when the values of kd and kt are significantly different. The mean of the two values may also be used as a single measure in such cases. The first of these measures, according to Eq. 6.49, is the easiest to intuitively interpret, and it is therefore the most popular.

One of the major problems with the purity-based measure is that it only accounts for the dominant label in the cluster and ignores the distribution of the remaining points. For example, a cluster that contains data points predominantly drawn from two classes, is better than one in which the data points belong to many different classes, even if the cluster purity is the same. To account for the variation across the different classes, the Gini index may be used. This measure is closely related to the notion of entropy, and it measures the level of inequality (or confusion) in the distribution of the entries in a row (or column) of the confusion matrix. As in the case of the purity measure, it can be computed with a row-wise method or a column-wise method, and it will evaluate to different values. Here the column-wise method is described. The Gini index Gj for column (algorithm-determined cluster) j is defined as follows:





kt

mij

2







Gj = 1

.

(6.50)







Mj




i=1
















The value of Gj will be close to 0 when the entries in a column of a confusion matrix are skewed, as in the case of Fig. 6.25. When the entries are evenly distributed, the value will be close to 1 1/kt, which is also the upper bound on this value. The average Gini coefficient is the weighted average of these different column-wise values where the weight of Gj is Mj :






kd







Gaverage =

j=1 Gj · Mj

.

(6.51)














Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   118   119   120   121   122   123   124   125   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin