(1, 2I)
|
(2, 1I)
|
|
2 B
|
B 3
|
A
|
B
|
|
|
|
|
1I
|
GRAPH
|
|
|
|
B
|
|
A
|
B
|
|
|
|
|
2I A
|
A 3I
|
(1, 3I)
|
(3, 1I)
|
|
|
|
|
Figure 17.9: Example of the product graph
Here, I(·) is the indicator function that takes the value of 1 when the two sequences are the same and 0 otherwise. Then, the overall kernel similarity K(G1 , G2) is defined as the sum of the probabilities of all the primitive sequence kernels over all possible walks:
K(G1, G2) =p(s1|G1) · p(s2|G2) · k(s1, s2)
|
(17.10)
|
s1,s2
|
|
Here, p(si|Gi) is the probability of the random walk sequence si in the graph Gi . Note that this kernel similarity value will be higher when the same label sequences are used by the two graphs. A key challenge is to compute these probabilities because there are an exponential number of walks of a specific length, and the length of a walk may be any value in the range (1, ∞).
The random walk kernel is computed using the notion of a product graph GX between G1 and G2. The product graphs are constructed by defining a vertex [u1, u2] between each pair of label matching vertices u1 and u 2 in the graphs G1 and G2 , respectively. An edge is added between a pair of vertices [u1, u2 ] and [v1, v2] in the product graph GX if and only an edge exists between the corresponding nodes in both the individual graphs G1 and G2 . In other words, the edge (u1, v1 ) must exist in G1 and the edge (u2 , v2) must exist in G2. An example of a product graph is illustrated in Fig. 17.9. Note that each walk in the product graph corresponds to a pair of label-matching sequence of vertices in the two graphs G1 and G2. Then, if A is the binary adjacency matrix of the product graph, then the entries of Ak provide the number of walks of length k between the different pairs of vertices. Therefore, the total weighted number of walks may be computed as follows:
|
∞
|
|
K(G1, G2) =
|
λk[Ak]ij = eT (I − λA)−1e
|
(17.11)
|
i,j k=1
Here, e is an |GX |-dimensional column vector of 1s, and λ ∈ (0, 1) is a discount factor. The discount factor λ should always be smaller than the inverse of the largest eigenvalue of A to ensure convergence of the infinite summation. Another variant of the random walk kernel is as follows:
∞
|
λk
|
|
|
K(G1, G2) =
|
|
[Ak]ij =
|
e
|
T exp(λA)
|
e
|
(17.12)
|
|
k!
|
|
i,j k=1
When the graphs in a collection are widely varying in size, the kernel functions of Eqs. 17.11 and 17.12 should be further normalized by dividing with |G1| · |G2|. Alternatively, in some
17.4. FREQUENT SUBSTRUCTURE MINING IN GRAPHS
|
575
|
probabilistic versions of the random walk kernel, the vectors eT and e are replaced with starting and stopping probabilities of the random walk over various nodes in the product graph. This computation is quite expensive, and may require as much as O(n6) time.
17.3.3.2 Shortest-Path Kernels
In the shortest-path kernel, a primitive kernel ks (i1, j1, i2, i2) is defined on node-pairs [i1, j1] ∈ G1 and [i2, j2] ∈ G2. There are several ways of defining the kernel function ks(i1, i2, j1, j2). A simple way of defining the kernel value is to set it to 1 when the dis-tance d(i1, i2) = d(j1, j2), and 0, otherwise.
Then, the overall kernel similarity is equal to the sum of all primitive kernels over
different quadruplets of nodes:
|
|
|
K(G1, G2) =
|
ks(i1, i2, j1, j2)
|
(17.13)
|
|
i1 ,i2,j1 ,j2
|
|
The shortest-path kernel may be computed by applying the all- pairs shortest-path algorithm on each of the graphs. It can be shown that the complexity of the kernel computation is O(n4). Although this is still quite expensive, it may be practical for small graphs, such as chemical compounds.
17.4 Frequent Substructure Mining in Graphs
Frequent subgraph mining is a fundamental building block for graph mining algorithms. Many of the clustering, classification, and similarity search techniques use frequent sub-structure mining as an intermediate step. This is because frequent substructures encode important properties of graphs in many application domains. For example, consider the series of phenolic acids illustrated in Fig. 17.10. These represent a family of organic com-pounds with similar chemical properties. Many complex variations of this family act as signaling molecules and agents of defense in plants. The properties of phenolic acids are a direct result of the presence of two frequent substructures, corresponding to the carboxyl group and phenol group, respectively. These groups are illustrated in Fig. 17.10 as well. The relevance of such substructural properties is not restricted to the chemical domain. This is the reason that frequent substructures are often used in the intermediate stages of many graph mining applications such as clustering and classification.
The definition of a frequent subgraph is identical to the case of association pattern mining, except that a subgraph relationship is used to count the support rather than a subset relationship. Many well-known frequent substructure mining algorithms are based on the enumeration tree principle discussed in Chap. 4. The simplest of these methods is based on the Apriori algorithm. This algorithm is discussed in detail in Fig. 4.2 of Chap. 4. The Apriori algorithm uses joins to create candidate patterns of size (k + 1) from frequent patterns of size k. However, because of the greater complexity of graph-structured data, the join between a pair of graphs may not result in a unique solution. For example, candidate frequent patterns can be generated by either node extensions or edge extensions. Thus, the main difference between these two variations is in terms of how frequent substructures of size k are defined and joined together to create candidate structures of size (k + 1). The “size”
of a subgraph may refer to either the number of nodes in it, or the number of edges in it depending on whether node extensions or edge extensions are used. Therefore, the following will describe the Apriori-based algorithm in a general way without specifically discussing
576
|
|
CHAPTER 17. MINING GRAPH DATA
|
|
|
|
Dostları ilə paylaş: |