Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə335/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   331   332   333   334   335   336   337   338   ...   423
1-Data Mining tarjima

O

1

H




















































1




























2

C




1











































H

1

C







C

1

H













1

1







2

1













H

C







C

H
















2

C




1




H





























































N

C

CH3







1

1 C

1

1










N

C 1 H





































1




2

1




H

O










H




O




H




(a) Acetaminophen

(b) Graph representation




Figure 17.1: A chemical compound (Acetaminophen) and its associated graph representation


symbol denoting a chemical element. Because of the presence of multiple atoms of the same element, such a graph will contain label repetitions. The repetition of labels within a single graph leads to numerous challenges in graph matching and distance computation.

Graph data are encountered in many real applications. Some examples of key applica-tions of graph data mining are as follows:





  • Chemical and biological data can be expressed as graphs in which each node corre-sponds to an atom and a bond between a pair of atoms is represented by an edge. The edges may be weighted to reflect bond strength. An example of a chemical compound and its corresponding graph are illustrated in Fig. 17.1. Figure 17.1a shows an illustra-tion of the chemical acetaminophen, a well-known analgesic. The corresponding graph representation is illustrated in Fig. 17.1b along with node labels and edge weights. In many graph mining applications, unit edge weights are assumed as a simplification.




  • XML data can be expressed as attributed graphs. The relationships between different attributes of a structured record can be expressed as edges.




  • Virtually any data type can be expressed as an entity-relationship graph. This provides a different way of mining conventional database records when they are expressed in the form of entity-relationship graphs.

Graph data are very powerful because of their ability to model arbitrary relationships between objects. The flexibility in graph representation comes at the price of greater com-putational complexity:





  1. Graphs lack the “flat” structure of multidimensional or even contextual (e.g., time series) data. The latter is much easier to analyze with conventional models.




  1. The repetition of labels among nodes leads to problems of isomorphism in computing similarity between graphs. This problem is NP-hard. This leads to computational challenges in similarity computation and graph matching.

17.2. MATCHING AND DISTANCE COMPUTATION IN GRAPHS

559

The second issue is of considerable importance, because both matching and distance com-putation are fundamental subproblems in graph mining applications. For example, in a fre-quent subgraph mining application, an important subproblem is that of subgraph matching.


This chapter is organized as follows. Section 17.2 addresses the problem of matching and distance computation in graphs. Graph transformation methods for distance compu-tation are discussed in Sect. 17.3. An important part of this section is the preprocessing methodologies, such as topological descriptors and kernel methods, that are often used for distance computation. Section 17.4 addresses the problem of pattern mining in graphs. The problem of clustering graphs is addressed in Sect. 17.5. Graph classification is addressed in Sect. 17.6. A summary is provided in Sect. 17.7.


17.2 Matching and Distance Computation in Graphs


The problems of matching and distance computation are closely related in the graph domain. Two graphs are said to match when a one-to-one correspondence can be established between the nodes of the two graphs, such that their labels match, and the edge presence between corresponding nodes match. The distance between such a pair of graphs is zero. Therefore, the problem of distance computation between a pair of graphs is at least as hard as that of graph matching. Matching graphs are also said to be isomorphic.


It should be pointed out that the term “matching” is used in two distinct contexts for graph mining, which can sometimes be confusing. For example, pairing up nodes in a single graph with the use of edges is also referred to as matching. Throughout this chapter, unless otherwise specified, our focus is not on the node matching problem, but the pairwise graph matching problem. This problem is also referred to as that of graph isomorphism.


Definition 17.2.1 (Graph Matching and Isomorphism) Two graphs G1 = (N1, A1) and G2 = (N2, A2) are isomorphic if and only if a one-to-one correspondence can be found between the nodes of N1 and N2 satisfying the following properties:





  1. For each pair of corresponding nodes i ∈ N1 and j ∈ N2, their labels are the same. l(i) = l(j)




  1. Let [i1, i2] be a node-pair in G1 and [j1, j2] be the corresponding node-pair in G2. Then the edge (i1, i2) exists in G1 if and only if the edge (j2, j2) exists in G2.

The computational challenges in graph matching arise because of the repetition in node labels. For example, consider two methane molecules, illustrated in Fig. 17.2. While the unique carbon atom in the two molecules can be matched in exactly one way, the hydrogen atoms can be matched up in 4! = 24 different ways. Two possible matchings are illustrated in Figs. 17.2a and b, respectively. In general, greater the level of label repetition in each graph is, larger the number of possible matchings will be. The number of possible matchings between a pair of graphs increases exponentially with the size of the matched graphs. For a pair of graphs containing n nodes each, the number of possible matchings can be as large as n!. This makes the problem of matching a pair of graphs computationally very expensive.


Lemma 17.2.1 The problem of determining whether a matching exists between a pair of graphs, is NP-hard.





560


CHAPTER 17.


MINING GRAPH DATA





DASHED LINES INDICATE


CORRESPONDENCE BETWEEN NODES


DASHED LINES INDICATE


CORRESPONDENCE BETWEEN NODES





H
H
H
H





H


H
H
H





C


C
C


C





H


H
H
H



H


H


H


H





METHANE MOLECULE 1
METHANE MOLECULE 2
METHANE MOLECULE 1
METHANE MOLECULE 2



(a)

(b)




Figure 17.2: Two possible matchings between a pair of graphs representing methane molecules


The bibliographic notes contain pointers to the proof of NP-hardness. When the graphs are very large, exact matches often do not exist. However, approximate matches may exist. The level of approximation is quantified with the use of a distance function. Therefore, distance function computation between graphs is a more general problem than that of graph matching, and it is at least as difficult. This issue will be discussed in detail in the next section.

Another related problem is that of subgraph matching. Unlike the problem of exact graph matching, the query graph needs to be explicitly distinguished from the data graph in this case.


Definition 17.2.2 (Node-Induced Subgraph) A node-induced subgraph of a graph G = (N, A) is a graph Gs = (Ns, As) satisfying the following properties:





  1. Ns ⊆ N




  1. As = A ∩ (Ns × Ns)



In other words, all the edges in the original graph G between nodes in the subset Ns ⊆ N are included in the subgraph Gs.

A subgraph isomorphism can be defined in terms of the node-induced subgraphs. A query graph Gq is a subgraph isomorphism of a data graph G, when it is an exact isomorphism of a node-induced subgraph of G.


Definition 17.2.3 (Subgraph Matching and Isomorphism) A query graph Gq = (Nq, Aq) is a subgraph isomorphism of the data graph G = (N, A) if and only if the fol-lowing conditions are satisfied:





  1. Each node in Nq should be matched to a unique node with the same label in N , but each node in N may not necessarily be matched. For each node i ∈ Nq, there must exist a unique matching node j ∈ N , such that their labels are the same.



l(i) = l(j)



  1. Let [i1, i2] be a node-pair in Gq , and let [j1, j2] be the corresponding node-pair in G, based on the matching discussed above. Then, the edge (i1, i2) exists in Gq if and only if the edge (j1, j2) exists in G.

17.2. MATCHING AND DISTANCE COMPUTATION IN GRAPHS

561





Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   331   332   333   334   335   336   337   338   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin