Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	345/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 341 342 343 344 345 346 347 348 ... 423

1-Data Mining tarjima

17.5. GRAPH CLUSTERING

581

reduce overlap among the diﬀerent subgraphs. Diﬀerent algorithms may vary on this step by using only frequent maximal subgraphs, or selecting a subset of graphs that are suﬃciently nonoverlapping with one another. Create a new feature f_i for each frequent subgraph S_i that is discovered. Let d be the total number of frequent subgraphs (features). This is the “lexicon” size in terms of which a text-like representation will be constructed.

For each graph G_i, create a vector-space representation in terms of the features f₁ . . . f_d. Each graph contains the features corresponding to the subgraphs that it contains. The frequency of each feature is the number of occurrences of the corre-sponding subgraph in the graph G_i. It is also possible to use a binary representation by only considering the presence or absence of subgraphs rather than frequency of presence. Use tf-idf normalization on the vector-space representation, as discussed in Chap. 13.

Use any of the text-clustering algorithms discussed in Sect. 13.3 in Chap. 13, in order to discover clusters of newly created text objects. Map the text clusters to graph object clusters.

This broader approach of using text-based methods is utilized frequently with many con-textual data types. For example, an almost exactly analogous approach is discussed for sequence clustering in Sect. 15.3.3 of Chap. 15. This is because a modified version of fre-quent pattern mining methods can be defined for most data types. It should be pointed out that, although the substructure-based transformation is discussed here, many of the kernel-based transformations and topological descriptors, discussed earlier in this chapter, may be used as well. For example, the kernel k-means algorithm can be used in conjunction with the graph kernels discussed in this chapter.

17.5.2.2 XProj: Direct Clustering with Frequent Subgraph Discovery

The XProj algorithm derives its name from the fact that it was originally proposed for XML graphs, and a substructure can be viewed as a PROJection of the graph. Neverthe-less, the approach is not specific to XML structures, and it can be applied to any other graph domain, such as chemical compounds. The XProj algorithm uses the substructure discovery process as an important subroutine, and diﬀerent applications may use diﬀerent substructure discovery methods, depending on the data domain. Therefore, the following will provide a generic description of the XProj algorithm for graph clustering, although the substructure discovery process may be implemented in an application-specific way. Because the algorithm uses the frequent substructures for the clustering process, an additional input to the algorithm is the minimum support minsup. Another input to the algorithm is the size l of the frequent substructures mined. The size of the frequent substructures is fixed in order to ensure robust computation of similarity. These are user-defined parameters that can be tuned to obtain the most eﬀective results.

The algorithm can be viewed as a representative approach similar to k-medoids, except that each representative is a set of frequent substructures. These represent the localized substructures of each group. The use of frequent-substructures as the representatives, instead of the original graphs, is crucial. This is because distances cannot be computed eﬀectively between pairs of graphs, when the sizes of the graphs are larger. On the other hand, the membership of frequent substructures provides a more intuitive way of computing similarity. It should be pointed out that, unlike transformational methods, the frequent substructures

582 CHAPTER 17. MINING GRAPH DATA

Algorithm XProj(Graph Database: G, Minimum Support: minsup

Structural Size: l, Number of Clusters: k )

begin
Initialize clusters C₁ . . . C_k randomly;

Compute frequent substructure sets F₁ . . . F_k from C₁ . . . C_k; repeat

Assign each graph G_j ∈ G to the cluster C_i for which the former’s similarity to F_i is the largest ∀i ∈ {1 . . . k};

Compute frequent substructure set F_i from C_i for each i ∈ {1 . . . k}; until convergence;

end

Figure 17.14: The frequent subgraph-based clustering algorithm (high level description)
are local to each cluster, and are therefore better optimized. This is the main advantage of this approach over a generic transformational approach.

There are a total of k such frequent substructure sets F₁ . . . F _k, and the graph database is partitioned into k groups around these localized representatives. The algorithm is initialized with a random partition of the database G into k clusters. These k clusters are denoted by C₁ . . . C_k. The frequent substructures F_i of each of these clusters C_i can be determined using any frequent substructure discovery algorithm. Subsequently, each graph in G_j ∈ G is assigned to one of the representative sets F_i based on the similarity of G_j to each representative set F_i. The details of the similarity computation will be discussed later. This process is repeated iteratively, so that the representative set F_i is generated from cluster C_i, and the cluster C_i is generated from the frequent set F_i. The process is repeated, until the change in the average similarity of each graph G_j to its assigned representative set F_i is no larger than a user-defined threshold. At this point, the algorithm is assumed to have converged, and it terminates. The overall algorithm is illustrated in Fig. 17.14.

It remains to be described how the similarity between a graph G_j and a representative set F_i is computed. The similarity between G_j and F_i is computed with the use of a coverage criterion. The similarity between G_j and F_i is equal to the fraction of frequent substructures in F_i that are a subgraph of G_j .

A major computational challenge is that the determination of frequent substructures in F_i may be too expensive. Furthermore, there may be a large number of frequent sub-structures in F_i that are highly overlapping with one another. To address these issues, the XProj algorithm proposes a number of optimizations. The first optimization is that the frequent substructures do not need to be determined exactly. An approximate algorithm for frequent substructure mining is designed. The second optimization is that only a subset of nonoverlapping substructures of length l are included in the sets F_i. The details of these optimizations may be found in pointers discussed in the bibliographic notes.

17.6 Graph Classification

It is assumed that a set of n graphs G₁ . . . G_n is available, but only a subset of these graphs is labeled. Among these, the first n_t ≤ n graphs are labeled, and the remaining (n − n_t) graphs are unlabeled. The labels are drawn from {1 . . . k}. It is desired to use the labels on the training graphs to infer the labels of unlabeled graphs.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 341 342 343 344 345 346 347 348 ... 423