Data Mining: The Textbook

(1, 2^I)

(2, 1^I)

Yüklə 17,13 Mb.

səhifə	341/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 337 338 339 340 341 342 343 344 ... 423

1-Data Mining tarjima

PRODUCT

₁I	GRAPH
B		A	B
		A	B
2^I A	A 3^I	(1, 3^I)	(3, 1^I)
2^I A	A 3^I

Figure 17.9: Example of the product graph

Here, I(·) is the indicator function that takes the value of 1 when the two sequences are the same and 0 otherwise. Then, the overall kernel similarity K(G₁ , G₂) is defined as the sum of the probabilities of all the primitive sequence kernels over all possible walks:

K(G₁, G₂) =p(s₁\|G₁) · p(s₂\|G₂) · k(s₁, s₂)	(17.10)
s₁,s₂

Here, p(s_i|G_i) is the probability of the random walk sequence s_i in the graph G_i . Note that this kernel similarity value will be higher when the same label sequences are used by the two graphs. A key challenge is to compute these probabilities because there are an exponential number of walks of a specific length, and the length of a walk may be any value in the range (1, ∞).

The random walk kernel is computed using the notion of a product graph G_X between G₁ and G₂. The product graphs are constructed by defining a vertex [u₁, u₂] between each pair of label matching vertices u₁ and u ₂ in the graphs G₁ and G₂ , respectively. An edge is added between a pair of vertices [u₁, u₂ ] and [v₁, v₂] in the product graph G_X if and only an edge exists between the corresponding nodes in both the individual graphs G₁ and G₂ . In other words, the edge (u₁, v₁ ) must exist in G₁ and the edge (u₂ , v₂) must exist in G₂. An example of a product graph is illustrated in Fig. 17.9. Note that each walk in the product graph corresponds to a pair of label-matching sequence of vertices in the two graphs G₁ and G₂. Then, if A is the binary adjacency matrix of the product graph, then the entries of Ak provide the number of walks of length k between the diﬀerent pairs of vertices. Therefore, the total weighted number of walks may be computed as follows:

	∞
K(G₁, G₂) =	λ^k[A^k]_ij = e^T (I − λA)⁻¹e	(17.11)

i,j k=1

Here, e is an |G_X |-dimensional column vector of 1s, and λ ∈ (0, 1) is a discount factor. The discount factor λ should always be smaller than the inverse of the largest eigenvalue of A to ensure convergence of the infinite summation. Another variant of the random walk kernel is as follows:

∞	λ^k
K(G₁, G₂) =		[A^k]_ij =	e	^Texp(λA)	e	(17.12)
	k!

i,j k=1

When the graphs in a collection are widely varying in size, the kernel functions of Eqs. 17.11 and 17.12 should be further normalized by dividing with |G₁| · |G₂|. Alternatively, in some

17.4. FREQUENT SUBSTRUCTURE MINING IN GRAPHS

575

probabilistic versions of the random walk kernel, the vectors eT and e are replaced with starting and stopping probabilities of the random walk over various nodes in the product graph. This computation is quite expensive, and may require as much as O(n6) time.

17.3.3.2 Shortest-Path Kernels

In the shortest-path kernel, a primitive kernel k_s (i₁, j₁, i₂, i₂) is defined on node-pairs [i₁, j₁] ∈ G₁ and [i₂, j₂] ∈ G₂. There are several ways of defining the kernel function k_s(i₁, i₂, j₁, j₂). A simple way of defining the kernel value is to set it to 1 when the dis-tance d(i₁, i₂) = d(j₁, j₂), and 0, otherwise.

Then, the overall kernel similarity is equal to the sum of all primitive kernels over

diﬀerent quadruplets of nodes:
K(G₁, G₂) =	k_s(i₁, i₂, j₁, j₂)	(17.13)
	i₁ ,i₂,j₁ ,j₂

The shortest-path kernel may be computed by applying the all- pairs shortest-path algorithm on each of the graphs. It can be shown that the complexity of the kernel computation is O(n4). Although this is still quite expensive, it may be practical for small graphs, such as chemical compounds.

17.4 Frequent Substructure Mining in Graphs

Frequent subgraph mining is a fundamental building block for graph mining algorithms. Many of the clustering, classification, and similarity search techniques use frequent sub-structure mining as an intermediate step. This is because frequent substructures encode important properties of graphs in many application domains. For example, consider the series of phenolic acids illustrated in Fig. 17.10. These represent a family of organic com-pounds with similar chemical properties. Many complex variations of this family act as signaling molecules and agents of defense in plants. The properties of phenolic acids are a direct result of the presence of two frequent substructures, corresponding to the carboxyl group and phenol group, respectively. These groups are illustrated in Fig. 17.10 as well. The relevance of such substructural properties is not restricted to the chemical domain. This is the reason that frequent substructures are often used in the intermediate stages of many graph mining applications such as clustering and classification.

The definition of a frequent subgraph is identical to the case of association pattern mining, except that a subgraph relationship is used to count the support rather than a subset relationship. Many well-known frequent substructure mining algorithms are based on the enumeration tree principle discussed in Chap. 4. The simplest of these methods is based on the Apriori algorithm. This algorithm is discussed in detail in Fig. 4.2 of Chap. 4. The Apriori algorithm uses joins to create candidate patterns of size (k + 1) from frequent patterns of size k. However, because of the greater complexity of graph-structured data, the join between a pair of graphs may not result in a unique solution. For example, candidate frequent patterns can be generated by either node extensions or edge extensions. Thus, the main diﬀerence between these two variations is in terms of how frequent substructures of size k are defined and joined together to create candidate structures of size (k + 1). The “size”

of a subgraph may refer to either the number of nodes in it, or the number of edges in it depending on whether node extensions or edge extensions are used. Therefore, the following will describe the Apriori-based algorithm in a general way without specifically discussing

576		CHAPTER 17. MINING GRAPH DATA

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 337 338 339 340 341 342 343 344 ... 423