Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə100/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   96   97   98   99   100   101   102   103   ...   423
1-Data Mining tarjima

O =

minj Dist(Xi, Yj ) .

(6.4)



i=1

160 CHAPTER 6. CLUSTER ANALYSIS


In other words, the sum of the distances of the different data points to their closest repre-sentatives needs to be minimized. Note that the assignment of data points to representatives depends on the choice of the representatives Y1 . . . Yk. In some variations of representative algorithms, such as k-medoid algorithms, it is assumed that the representatives Y1 . . . Yk are drawn from the original database D, although this will obviously not provide an optimal solution. In general, the discussion in this section will not automatically assume that the representatives are drawn from the original database D, unless specified otherwise.


One observation about the formulation of Eq. 6.4 is that the representatives Y1 . . . Yk and the optimal assignment of data points to representatives are unknown a priori, but they depend on each other in a circular way. For example, if the optimal representatives are known, then the optimal assignment is easy to determine, and vice versa. Such optimiza-tion problems are solved with the use of an iterative approach where candidate represen-tatives and candidate assignments are used to improve each other. Therefore, the generic k-representatives approach starts by initializing the k representatives S with the use of a straightforward heuristic (such as random sampling from the original data), and then refines the representatives and the clustering assignment, iteratively, as follows:





  • (Assign step) Assign each data point to its closest representative in S using distance function Dist(·, ·), and denote the corresponding clusters by C1 . . . Ck.

  • (Optimize step) Determine the optimal representative Yj for each cluster Cj that

minimizes its local objective function Xi Cj Dist(Xi, Yj ) .


It will be evident later in this chapter that this two-step procedure is very closely related to generative models of cluster analysis in the form of expectation-maximization algorithms. The second step of local optimization is simplified by this two-step iterative approach, because it no longer depends on an unknown assignment of data points to clusters as in the global optimization problem of Eq. 6.4. Typically, the optimized representative can be shown to be some central measure of the data points in the jth cluster Cj , and the precise measure depends on the choice of the distance function Dist(Xi, Yj ) . In particular, for the case of the Euclidean distance and cosine similarity functions, it can be shown that the optimal centralized representative of each cluster is its mean. However, different distance functions may lead to a slightly different type of centralized representative, and these lead to different variations of this broader approach, such as the k-means and k-medians algorithms. Thus, the k-representative approach defines a family of algorithms, in which minor changes to the basic framework allow the use of different distance criteria. These different criteria will be discussed below. The generic framework for representative-based algorithms with an unspecified distance function is illustrated in the pseudocode of Fig. 6.2. The idea is to improve the objective function over multiple iterations. Typically, the increase is significant in early iterations, but it slows down in later iterations. When the improvement in the objective function in an iteration is less than a user-defined threshold, the algorithm may be allowed to terminate. The primary computational bottleneck of the approach is the assignment step where the distances need to be computed between all point-representative pairs. The time complexity of each iteration is O(k · n · d) for a data set of size n and dimensionality d. The algorithm typically terminates in a small constant number of iterations.


The inner workings of the k-representatives algorithm are illustrated with an example in Fig. 6.3, where the data contains three natural clusters, denoted by A, B, and C. For illustration, it is assumed that the input k to the algorithm is the same as the number of natural clusters in the data, which, in this case, is 3. The Euclidean distance function


Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   96   97   98   99   100   101   102   103   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin