Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə342/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   338   339   340   341   342   343   344   345   ...   423
1-Data Mining tarjima

HO

H O







HO


















SALICYLIC ACID
3 HYDROXYBENZOIC ACID
4 HYDROXYBENZOIC ACID



DATABASE OF PHENOLIC ACIDS


HO



CARBOXYL GROUP PHENOL GROUP


FREQUENT SUBSTRUCTURES OF PHENOLIC ACIDS

Figure 17.10: Examples of frequent substructures in a database of phenolic acids


either node extensions or edge extensions. Subsequently, the precise changes required to enable these two specific variations will be discussed.

The overall algorithm for frequent subgraph mining is illustrated in Fig. 17.11. The input to the algorithm is the graph database G = {G1 . . . Gn} and a minimum support value minsup. The basic algorithm structure is similar to that of the Apriori algorithm, discussed in Fig. 4.2 of Chap. 4. A levelwise algorithm is used, in which candidate subgraphs Ck+1 of size (k + 1) are generated by using joins on graph pairs from the set of frequent subgraphs Fk of size k. As discussed earlier, the size of a subgraph may refer to either its nodes or edges, depending on the specific algorithm used. The two graphs need to be matching in a subgraph of size (k − 1) for a join to be successfully performed. The resulting candidate subgraph will be of size ( k + 1). Therefore, one of the important steps of join processing, is determining whether two graphs share a subgraph of size (k − 1) in common. The matching algorithms discussed in Sect. 17.2 can be used for this purpose. In some applications, where node labels are distinct and isomorphism is not an issue, this step can be performed very efficiently. On the other hand, for large graphs that have many repeating node labels, this step is slow because of isomorphism.


After the pairs of matching graphs have been identified, joins are performed on them in order to generate the candidates Ck +1 of size (k + 1). The different node-based and edge-based variations in the methods for performing joins will be described later. Furthermore, the Apriori pruning trick is used. Candidates in Ck+1 that are such that any of their k-subgraphs do not exist in Fk are pruned. For each remaining candidate subgraph, the support is computed with respect to the graph database G. The subgraph isomorphism algorithm discussed in Sect. 17.2 needs to be used for computing the support. All candidates in Ck+1 that meet the minimum support requirement are retained in Fk+1. The procedure is repeated iteratively until an empty set Fk +1 is generated. At this point, the algorithm terminates, and the set of frequent subgraphs in ∪ki=1 Fi is reported. Next, the two different ways of defining the size k of a graph, corresponding to node- and edge-based joins, will be described.



17.4. FREQUENT SUBSTRUCTURE MINING IN GRAPHS

577

Algorithm GraphApriori(Graph Database: G,


Minimum Support: minsup);


begin
F1 = { All Frequent singleton graphs };


k = 1;
while Fk is not empty do begin

Generate Ck+1 by joining pairs of graphs in Fk that share a subgraph of size (k − 1) in common;


Prune subgraphs from Ck+1 that violate downward closure; Determine Fk+1 by support counting on (Ck+1, G) and retaining


subgraphs from Ck+1 with support at least minsup; k = k + 1;

end;
return(k F );


i=1 i
Figure 17.11: The basic frequent subgraph discovery algorithm is related to the Apriori algorithm. The reader is encouraged to compare this pseudocode with the Apriori algorithm described in Fig. 4.2 of Chap. 4.




Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   338   339   340   341   342   343   344   345   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin