Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	342/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 338 339 340 341 342 343 344 345 ... 423

1-Data Mining tarjima

SALICYLIC ACID 3 HYDROXYBENZOIC ACID 4 HYDROXYBENZOIC ACID DATABASE OF PHENOLIC ACIDS

HO	H O
	H O
	HO
	HO

SALICYLIC ACID
3 HYDROXYBENZOIC ACID
4 HYDROXYBENZOIC ACID

DATABASE OF PHENOLIC ACIDS

HO

CARBOXYL GROUP PHENOL GROUP

FREQUENT SUBSTRUCTURES OF PHENOLIC ACIDS

Figure 17.10: Examples of frequent substructures in a database of phenolic acids

either node extensions or edge extensions. Subsequently, the precise changes required to enable these two specific variations will be discussed.

The overall algorithm for frequent subgraph mining is illustrated in Fig. 17.11. The input to the algorithm is the graph database G = {G₁ . . . G_n} and a minimum support value minsup. The basic algorithm structure is similar to that of the Apriori algorithm, discussed in Fig. 4.2 of Chap. 4. A levelwise algorithm is used, in which candidate subgraphs C_k₊₁ of size (k + 1) are generated by using joins on graph pairs from the set of frequent subgraphs F_k of size k. As discussed earlier, the size of a subgraph may refer to either its nodes or edges, depending on the specific algorithm used. The two graphs need to be matching in a subgraph of size (k − 1) for a join to be successfully performed. The resulting candidate subgraph will be of size ( k + 1). Therefore, one of the important steps of join processing, is determining whether two graphs share a subgraph of size (k − 1) in common. The matching algorithms discussed in Sect. 17.2 can be used for this purpose. In some applications, where node labels are distinct and isomorphism is not an issue, this step can be performed very eﬃciently. On the other hand, for large graphs that have many repeating node labels, this step is slow because of isomorphism.

After the pairs of matching graphs have been identified, joins are performed on them in order to generate the candidates C_k ₊₁ of size (k + 1). The diﬀerent node-based and edge-based variations in the methods for performing joins will be described later. Furthermore, the Apriori pruning trick is used. Candidates in C_k₊₁ that are such that any of their k-subgraphs do not exist in F_k are pruned. For each remaining candidate subgraph, the support is computed with respect to the graph database G. The subgraph isomorphism algorithm discussed in Sect. 17.2 needs to be used for computing the support. All candidates in C_k₊₁ that meet the minimum support requirement are retained in F_k₊₁. The procedure is repeated iteratively until an empty set F_k ₊₁ is generated. At this point, the algorithm terminates, and the set of frequent subgraphs in ∪k_i₌₁ F_i is reported. Next, the two diﬀerent ways of defining the size k of a graph, corresponding to node- and edge-based joins, will be described.

17.4. FREQUENT SUBSTRUCTURE MINING IN GRAPHS

577

Algorithm GraphApriori(Graph Database: G,

Minimum Support: minsup);

begin
F₁ = { All Frequent singleton graphs };

k = 1;
while F_k is not empty do begin

Generate C_k₊₁ by joining pairs of graphs in F_k that share a subgraph of size (k − 1) in common;

Prune subgraphs from C_k₊₁ that violate downward closure; Determine F_k₊₁ by support counting on (C_k₊₁, G) and retaining

subgraphs from C_k₊₁ with support at least minsup; k = k + 1;

end;
return(∪^k F );

_i₌₁ _i
Figure 17.11: The basic frequent subgraph discovery algorithm is related to the Apriori algorithm. The reader is encouraged to compare this pseudocode with the Apriori algorithm described in Fig. 4.2 of Chap. 4.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 338 339 340 341 342 343 344 345 ... 423