Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	99/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 95 96 97 98 99 100 101 102 ... 423

1-Data Mining tarjima

6.3. REPRESENTATIVE-BASED ALGORITHMS

159

Algorithm GenericRepresentative(Database: D, Number of Representatives: k)

begin
Initialize representative set S;

repeat
Create clusters (C₁ . . . C_k) by assigning each

point in D to closest representative in S

using the distance function Dist(·, ·);

Recreate set S by determining one representative Y_j for each C_j that minimizes _X_i_C_j Dist(X_i, Y_j );

until convergence;

return (C₁ . . . C_k);
end

Figure 6.2: Generic representative algorithm with unspecified distance function

Fisher score (cf. Sect. 10.2 of Chap. 10). The Fisher score, discussed in Sect. 10.2.1.3 of Chap. 10, measures the ratio of the intercluster variance to the intracluster variance on any particular attribute. Furthermore, it is possible to apply this two-step procedure iteratively. However, some modifications to the first step are required. Instead of selecting the top-k features, the weights of the top-k features are set to 1, and the remainder are set to α < 1. Here, α is a user-specified parameter. In the final step, the top-k features are selected.

Wrapper models are often combined with filter models to create hybrid models for better eﬃciency. In this case, candidate feature subsets are constructed with the use of filter models. Then, the quality of each candidate feature subset is evaluated with a clustering algorithm. The evaluation can be performed either with a cluster validity criterion or with the use of a classification algorithm on the resulting cluster labels. The best candidate feature subset is selected. Hybrid models provide better accuracy than filter models and are more eﬃcient than wrapper models.

6.3 Representative-Based Algorithms

Representative-based algorithms are the simplest of all clustering algorithms because they rely directly on intuitive notions of distance (or similarity) to cluster data points. In representative-based algorithms, the clusters are created in one shot, and hierarchical rela-tionships do not exist among diﬀerent clusters. This is typically done with the use of a set of partitioning representatives. The partitioning representatives may either be created as a function of the data points in the clusters (e.g., the mean) or may be selected from the existing data points in the cluster. The main insight of these methods is that the discovery of high-quality clusters in the data is equivalent to discovering a high -quality set of repre-sentatives. Once the representatives have been determined, a distance function can be used to assign the data points to their closest representatives.

Typically, it is assumed that the number of clusters, denoted by k, is specified by the user. Consider a data set D containing n data points denoted by X₁ . . . X_n in d-dimensional space. The goal is to determine k representatives Y₁ . . . Y_k that minimize the following objective function O:

	n

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 95 96 97 98 99 100 101 102 ... 423