CARBOXYL GROUP PHENOL GROUP
FREQUENT SUBSTRUCTURES OF PHENOLIC ACIDS
Figure 17.10: Examples of frequent substructures in a database of phenolic acids
either node extensions or edge extensions.
Subsequently, the precise changes required to enable these two specific variations will be discussed.
The overall algorithm for frequent subgraph mining is illustrated in Fig. 17.11. The input to the algorithm is the graph database G = {G1 . . . Gn} and a minimum support value minsup. The basic algorithm structure is similar to that of the Apriori algorithm, discussed in Fig. 4.2 of Chap. 4. A levelwise algorithm is used, in which candidate subgraphs Ck+1 of size (k + 1) are generated by using joins on graph pairs from the set of frequent subgraphs Fk of size k. As discussed earlier, the size of a subgraph may refer to either its nodes or edges, depending on the specific algorithm used. The two graphs need to be matching in a subgraph of size (k − 1) for a join to be successfully performed. The resulting candidate subgraph will be of size ( k + 1). Therefore, one of the important steps of join processing, is determining whether two graphs share a subgraph of size (k − 1) in common. The matching algorithms discussed in Sect. 17.2 can be used for this purpose. In some applications, where node labels are distinct and isomorphism is not an issue, this step can be performed very efficiently. On the other hand, for large graphs that have many repeating node labels, this step is slow because of isomorphism.
After the pairs of matching graphs have been identified, joins are performed on them in order to generate the candidates Ck +1 of size (k + 1). The different node-based and edge-based variations in the methods for performing joins will be described later. Furthermore, the Apriori pruning trick is used. Candidates in Ck+1 that are such that any of their k-subgraphs do not exist in Fk are pruned. For each remaining candidate subgraph, the support is computed with respect to the graph database G. The subgraph isomorphism algorithm discussed in Sect. 17.2 needs to be used for computing the support. All candidates in Ck+1 that meet the minimum support requirement are retained in Fk+1. The procedure is repeated iteratively until an empty set Fk +1 is generated. At this point, the algorithm terminates, and the set of frequent subgraphs in ∪ki=1 Fi is reported. Next, the two different ways of defining the size k of a graph, corresponding to node- and edge-based joins, will be described.