Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə99/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   95   96   97   98   99   100   101   102   ...   423
1-Data Mining tarjima

6.3. REPRESENTATIVE-BASED ALGORITHMS

159

Algorithm GenericRepresentative(Database: D, Number of Representatives: k)


begin
Initialize representative set S;


repeat
Create clusters (C1 . . . Ck) by assigning each


point in D to closest representative in S

using the distance function Dist(·, ·);


Recreate set S by determining one representative Yj for each Cj that minimizes Xi Cj Dist(Xi, Yj );


until convergence;


return (C1 . . . Ck);
end

Figure 6.2: Generic representative algorithm with unspecified distance function




Fisher score (cf. Sect. 10.2 of Chap. 10). The Fisher score, discussed in Sect. 10.2.1.3 of Chap. 10, measures the ratio of the intercluster variance to the intracluster variance on any particular attribute. Furthermore, it is possible to apply this two-step procedure iteratively. However, some modifications to the first step are required. Instead of selecting the top-k features, the weights of the top-k features are set to 1, and the remainder are set to α < 1. Here, α is a user-specified parameter. In the final step, the top-k features are selected.

Wrapper models are often combined with filter models to create hybrid models for better efficiency. In this case, candidate feature subsets are constructed with the use of filter models. Then, the quality of each candidate feature subset is evaluated with a clustering algorithm. The evaluation can be performed either with a cluster validity criterion or with the use of a classification algorithm on the resulting cluster labels. The best candidate feature subset is selected. Hybrid models provide better accuracy than filter models and are more efficient than wrapper models.


6.3 Representative-Based Algorithms


Representative-based algorithms are the simplest of all clustering algorithms because they rely directly on intuitive notions of distance (or similarity) to cluster data points. In representative-based algorithms, the clusters are created in one shot, and hierarchical rela-tionships do not exist among different clusters. This is typically done with the use of a set of partitioning representatives. The partitioning representatives may either be created as a function of the data points in the clusters (e.g., the mean) or may be selected from the existing data points in the cluster. The main insight of these methods is that the discovery of high-quality clusters in the data is equivalent to discovering a high -quality set of repre-sentatives. Once the representatives have been determined, a distance function can be used to assign the data points to their closest representatives.


Typically, it is assumed that the number of clusters, denoted by k, is specified by the user. Consider a data set D containing n data points denoted by X1 . . . Xn in d-dimensional space. The goal is to determine k representatives Y1 . . . Yk that minimize the following objective function O:








n





Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   95   96   97   98   99   100   101   102   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin