Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	335/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 331 332 333 334 335 336 337 338 ... 423

1-Data Mining tarjima

DASHED LINES INDICATE CORRESPONDENCE BETWEEN NODES DASHED LINES INDICATE CORRESPONDENCE BETWEEN NODES
METHANE MOLECULE 1 METHANE MOLECULE 2 METHANE MOLECULE 1 METHANE MOLECULE 2

O	1	H


								1
						2		C		1

			H	1		C				C	1	H
				1		1				2	1
			H			C				C		H
						2		C		1		H


N	C		CH3					1	1 _C		1	1
								N				C ¹ H

							1			2			1
H	O							H		O		H
(a) Acetaminophen					(b) Graph representation

Figure 17.1: A chemical compound (Acetaminophen) and its associated graph representation

symbol denoting a chemical element. Because of the presence of multiple atoms of the same element, such a graph will contain label repetitions. The repetition of labels within a single graph leads to numerous challenges in graph matching and distance computation.

Graph data are encountered in many real applications. Some examples of key applica-tions of graph data mining are as follows:

Chemical and biological data can be expressed as graphs in which each node corre-sponds to an atom and a bond between a pair of atoms is represented by an edge. The edges may be weighted to reflect bond strength. An example of a chemical compound and its corresponding graph are illustrated in Fig. 17.1. Figure 17.1a shows an illustra-tion of the chemical acetaminophen, a well-known analgesic. The corresponding graph representation is illustrated in Fig. 17.1b along with node labels and edge weights. In many graph mining applications, unit edge weights are assumed as a simplification.

XML data can be expressed as attributed graphs. The relationships between diﬀerent attributes of a structured record can be expressed as edges.

Virtually any data type can be expressed as an entity-relationship graph. This provides a diﬀerent way of mining conventional database records when they are expressed in the form of entity-relationship graphs.

Graph data are very powerful because of their ability to model arbitrary relationships between objects. The flexibility in graph representation comes at the price of greater com-putational complexity:

Graphs lack the “flat” structure of multidimensional or even contextual (e.g., time series) data. The latter is much easier to analyze with conventional models.

The repetition of labels among nodes leads to problems of isomorphism in computing similarity between graphs. This problem is NP-hard. This leads to computational challenges in similarity computation and graph matching.

17.2. MATCHING AND DISTANCE COMPUTATION IN GRAPHS

559

The second issue is of considerable importance, because both matching and distance com-putation are fundamental subproblems in graph mining applications. For example, in a fre-quent subgraph mining application, an important subproblem is that of subgraph matching.

This chapter is organized as follows. Section 17.2 addresses the problem of matching and distance computation in graphs. Graph transformation methods for distance compu-tation are discussed in Sect. 17.3. An important part of this section is the preprocessing methodologies, such as topological descriptors and kernel methods, that are often used for distance computation. Section 17.4 addresses the problem of pattern mining in graphs. The problem of clustering graphs is addressed in Sect. 17.5. Graph classification is addressed in Sect. 17.6. A summary is provided in Sect. 17.7.

17.2 Matching and Distance Computation in Graphs

The problems of matching and distance computation are closely related in the graph domain. Two graphs are said to match when a one-to-one correspondence can be established between the nodes of the two graphs, such that their labels match, and the edge presence between corresponding nodes match. The distance between such a pair of graphs is zero. Therefore, the problem of distance computation between a pair of graphs is at least as hard as that of graph matching. Matching graphs are also said to be isomorphic.

It should be pointed out that the term “matching” is used in two distinct contexts for graph mining, which can sometimes be confusing. For example, pairing up nodes in a single graph with the use of edges is also referred to as matching. Throughout this chapter, unless otherwise specified, our focus is not on the node matching problem, but the pairwise graph matching problem. This problem is also referred to as that of graph isomorphism.

Definition 17.2.1 (Graph Matching and Isomorphism) Two graphs G₁ = (N₁, A₁) and G₂ = (N₂, A₂) are isomorphic if and only if a one-to-one correspondence can be found between the nodes of N₁ and N₂ satisfying the following properties:

For each pair of corresponding nodes i ∈ N₁ and j ∈ N₂, their labels are the same. l(i) = l(j)

Let [i₁, i₂] be a node-pair in G₁ and [j₁, j₂] be the corresponding node-pair in G₂. Then the edge (i₁, i₂) exists in G₁ if and only if the edge (j₂, j₂) exists in G₂.

The computational challenges in graph matching arise because of the repetition in node labels. For example, consider two methane molecules, illustrated in Fig. 17.2. While the unique carbon atom in the two molecules can be matched in exactly one way, the hydrogen atoms can be matched up in 4! = 24 diﬀerent ways. Two possible matchings are illustrated in Figs. 17.2a and b, respectively. In general, greater the level of label repetition in each graph is, larger the number of possible matchings will be. The number of possible matchings between a pair of graphs increases exponentially with the size of the matched graphs. For a pair of graphs containing n nodes each, the number of possible matchings can be as large as n!. This makes the problem of matching a pair of graphs computationally very expensive.

Lemma 17.2.1 The problem of determining whether a matching exists between a pair of graphs, is NP-hard.

560

CHAPTER 17.

MINING GRAPH DATA

DASHED LINES INDICATE

CORRESPONDENCE BETWEEN NODES

DASHED LINES INDICATE

CORRESPONDENCE BETWEEN NODES

H
H
H
H

H

H
H
H

C

C
C

C

H

H
H
H

H

H

H

H

METHANE MOLECULE 1
METHANE MOLECULE 2
METHANE MOLECULE 1
METHANE MOLECULE 2

(a)

(b)

Figure 17.2: Two possible matchings between a pair of graphs representing methane molecules

The bibliographic notes contain pointers to the proof of NP-hardness. When the graphs are very large, exact matches often do not exist. However, approximate matches may exist. The level of approximation is quantified with the use of a distance function. Therefore, distance function computation between graphs is a more general problem than that of graph matching, and it is at least as diﬃcult. This issue will be discussed in detail in the next section.

Another related problem is that of subgraph matching. Unlike the problem of exact graph matching, the query graph needs to be explicitly distinguished from the data graph in this case.

Definition 17.2.2 (Node-Induced Subgraph) A node-induced subgraph of a graph G = (N, A) is a graph G_s = (N_s, A_s) satisfying the following properties:

N_s ⊆ N

A_s = A ∩ (N_s × N_s)

In other words, all the edges in the original graph G between nodes in the subset N_s ⊆ N are included in the subgraph G_s.

A subgraph isomorphism can be defined in terms of the node-induced subgraphs. A query graph G_q is a subgraph isomorphism of a data graph G, when it is an exact isomorphism of a node-induced subgraph of G.

Definition 17.2.3 (Subgraph Matching and Isomorphism) A query graph G_q = (N_q, A_q) is a subgraph isomorphism of the data graph G = (N, A) if and only if the fol-lowing conditions are satisfied:

Each node in N_q should be matched to a unique node with the same label in N , but each node in N may not necessarily be matched. For each node i ∈ N_q, there must exist a unique matching node j ∈ N , such that their labels are the same.

l(i) = l(j)

Let [i₁, i₂] be a node-pair in G_q , and let [j₁, j₂] be the corresponding node-pair in G, based on the matching discussed above. Then, the edge (i₁, i₂) exists in G_q if and only if the edge (j₁, j₂) exists in G.

17.2. MATCHING AND DISTANCE COMPUTATION IN GRAPHS								561

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 331 332 333 334 335 336 337 338 ... 423