reduce overlap among the different subgraphs. Different algorithms may vary on this step by using only frequent maximal subgraphs, or selecting a subset of graphs that are sufficiently nonoverlapping with one another. Create a new feature fi for each frequent subgraph Si that is discovered. Let d be the total number of frequent subgraphs (features). This is the “lexicon” size in terms of which a text-like representation will be constructed.
For each graph Gi, create a vector-space representation in terms of the features f1 . . . fd. Each graph contains the features corresponding to the subgraphs that it contains. The frequency of each feature is the number of occurrences of the corre-sponding subgraph in the graph Gi. It is also possible to use a binary representation by only considering the presence or absence of subgraphs rather than frequency of presence. Use tf-idf normalization on the vector-space representation, as discussed in Chap. 13.
Use any of the text-clustering algorithms discussed in Sect. 13.3 in Chap. 13, in order to discover clusters of newly created text objects. Map the text clusters to graph object clusters.
This broader approach of using text-based methods is utilized frequently with many con-textual data types. For example, an almost exactly analogous approach is discussed for sequence clustering in Sect. 15.3.3 of Chap. 15. This is because a modified version of fre-quent pattern mining methods can be defined for most data types. It should be pointed out that, although the substructure-based transformation is discussed here, many of the kernel-based transformations and topological descriptors, discussed earlier in this chapter, may be used as well. For example, the kernel k-means algorithm can be used in conjunction with the graph kernels discussed in this chapter.
17.5.2.2 XProj: Direct Clustering with Frequent Subgraph Discovery
The XProj algorithm derives its name from the fact that it was originally proposed for XML graphs, and a substructure can be viewed as a PROJection of the graph. Neverthe-less, the approach is not specific to XML structures, and it can be applied to any other graph domain, such as chemical compounds. The XProj algorithm uses the substructure discovery process as an important subroutine, and different applications may use different substructure discovery methods, depending on the data domain. Therefore, the following will provide a generic description of the XProj algorithm for graph clustering, although the substructure discovery process may be implemented in an application-specific way. Because the algorithm uses the frequent substructures for the clustering process, an additional input to the algorithm is the minimum support minsup. Another input to the algorithm is the size l of the frequent substructures mined. The size of the frequent substructures is fixed in order to ensure robust computation of similarity. These are user-defined parameters that can be tuned to obtain the most effective results.
The algorithm can be viewed as a representative approach similar to k-medoids, except that each representative is a set of frequent substructures. These represent the localized substructures of each group. The use of frequent-substructures as the representatives, instead of the original graphs, is crucial. This is because distances cannot be computed effectively between pairs of graphs, when the sizes of the graphs are larger. On the other hand, the membership of frequent substructures provides a more intuitive way of computing similarity. It should be pointed out that, unlike transformational methods, the frequent substructures
582 CHAPTER 17. MINING GRAPH DATA
Algorithm XProj(Graph Database: G, Minimum Support: minsup
Structural Size: l, Number of Clusters: k )
begin
Initialize clusters C1 . . . Ck randomly;
Compute frequent substructure sets F1 . . . Fk from C1 . . . Ck; repeat
Assign each graph Gj ∈ G to the cluster Ci for which the former’s similarity to Fi is the largest ∀i ∈ {1 . . . k};
Compute frequent substructure set Fi from Ci for each i ∈ {1 . . . k}; until convergence;
end
Figure 17.14: The frequent subgraph-based clustering algorithm (high level description)
are local to each cluster, and are therefore better optimized. This is the main advantage of this approach over a generic transformational approach.
There are a total of k such frequent substructure sets F1 . . . F k, and the graph database is partitioned into k groups around these localized representatives. The algorithm is initialized with a random partition of the database G into k clusters. These k clusters are denoted by C1 . . . Ck. The frequent substructures Fi of each of these clusters Ci can be determined using any frequent substructure discovery algorithm. Subsequently, each graph in Gj ∈ G is assigned to one of the representative sets Fi based on the similarity of Gj to each representative set Fi. The details of the similarity computation will be discussed later. This process is repeated iteratively, so that the representative set Fi is generated from cluster Ci, and the cluster Ci is generated from the frequent set Fi. The process is repeated, until the change in the average similarity of each graph Gj to its assigned representative set Fi is no larger than a user-defined threshold. At this point, the algorithm is assumed to have converged, and it terminates. The overall algorithm is illustrated in Fig. 17.14.
It remains to be described how the similarity between a graph Gj and a representative set Fi is computed. The similarity between Gj and Fi is computed with the use of a coverage criterion. The similarity between Gj and Fi is equal to the fraction of frequent substructures in Fi that are a subgraph of Gj .
A major computational challenge is that the determination of frequent substructures in Fi may be too expensive. Furthermore, there may be a large number of frequent sub-structures in Fi that are highly overlapping with one another. To address these issues, the XProj algorithm proposes a number of optimizations. The first optimization is that the frequent substructures do not need to be determined exactly. An approximate algorithm for frequent substructure mining is designed. The second optimization is that only a subset of nonoverlapping substructures of length l are included in the sets Fi. The details of these optimizations may be found in pointers discussed in the bibliographic notes.
17.6 Graph Classification
It is assumed that a set of n graphs G1 . . . Gn is available, but only a subset of these graphs is labeled. Among these, the first nt ≤ n graphs are labeled, and the remaining (n − nt) graphs are unlabeled. The labels are drawn from {1 . . . k}. It is desired to use the labels on the training graphs to infer the labels of unlabeled graphs.
Dostları ilə paylaş: |