Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	101/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 97 98 99 100 101 102 103 104 ... 423

1-Data Mining tarjima

6.3. REPRESENTATIVE-BASED ALGORITHMS

161

Figure 6.3: Illustration of k-representative algorithm with random initialization

162 CHAPTER 6. CLUSTER ANALYSIS

is used, and therefore the “re-centering” step uses the mean of the cluster. The initial set of representatives (or seeds) is chosen randomly from the data space. This leads to a particularly bad initialization, where two of the representatives are close to cluster B, and one of them lies somewhere midway between clusters A and C. As a result, the cluster B is initially split up by the “sphere of influence” of two representatives, whereas most of the points in clusters A and C are assigned to a single representative in the first assignment step. This situation is illustrated in Fig. 6.3a. However, because each representative is assigned a diﬀerent number of data points from the diﬀerent clusters, the representatives drift in subsequent iterations to one of the unique clusters. For example, representative 1 steadily drifts toward cluster A, and representative 3 steadily drifts toward cluster C. At the same time, representative 2 becomes a better centralized representative of cluster B. As a result, cluster B is no longer split up among diﬀerent representatives by the end of iteration 10 (Fig. 6.3f). An interesting observation is that even though the initialization was so poor, it required only 10 iterations for the k -representatives approach to create a reasonable clustering of the data. In practice, this is generally true of k- representative methods, which converge relatively fast toward a good clustering of the data points. However, it is possible for k-means to converge to suboptimal solutions, especially when an outlier data point is selected as an initial representative for the algorithm. In such a case, one of the clusters may contain a singleton point that is not representative of the data set, or it may contain two merged clusters. The handling of such cases is discussed in the section on implementation issues. In the following section, some special cases and variations of this framework will be discussed. Most of the variations of the k-representative framework are defined by the choice of the distance function Dist(X_i, Y_j ) between the data points X_i and the representatives Y_j . Each of these choices results in a diﬀerent type of centralized representative of a cluster.

6.3.1 The k-Means Algorithm

In the k -means algorithm, the sum of the squares of the Euclidean distances of data points to their closest representatives is used to quantify the objective function of the clustering. Therefore, we have:

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 97 98 99 100 101 102 103 104 ... 423