Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə59/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   55   56   57   58   59   60   61   62   ...   423
1-Data Mining tarjima

3.6. SUPERVISED SIMILARITY FUNCTIONS

87




  1. Maximum common subgraph distance: When two graphs contain a large subgraph in common, they are generally considered more similar. The maximum common subgraph problem and the related distance functions are addressed in Sect. 17.2 of Chap. 17.




  1. Substructure-based similarity: Although it is difficult to match two large graphs, it is much easier to match smaller substructures. The core idea is to count the fre-quently occurring substructures between the two graphs and report it as a similarity measure. This can be considered the graph analog of subsequence-based similarity in strings. Substructure-based similarity measures are discussed in detail in Sect. 17.3 of Chap. 17.




  1. Graph-edit distance: This distance measure is analogous to the string-edit distance and is defined as the number of edits required to transform one graph to the other. Because graph matching is a hard problem, this measure is difficult to implement for large graphs. The graph-edit distance is discussed in detail in Sect. 17.2.3.2 of Chap. 17.




  1. Graph kernels: Numerous kernel functions have been defined to measure similarity between graphs, such as the shortest-path kernel and the random-walk kernel. This topic is discussed in detail in Sect. 17.3.3 of Chap. 17.

These methods are quite complex and require a greater background in the area of graphs.


Therefore, the discussion of these measures is deferred to Chap. 17 of this book.


3.6 Supervised Similarity Functions


The previous sections discussed similarity measures that do not require any understanding of user intentions. In practice, the relevance of a feature or the choice of distance function heavily depends on the domain at hand. For example, for an image data set, should the color feature or the texture feature be weighted more heavily? These aspects cannot be modeled by a distance function without taking the user intentions into account. Unsupervised measures, such as the Lp-norm, treat all features equally, and have little intrinsic understanding of the end user’s semantic notion of similarity. The only way to incorporate this information into the similarity function is to use explicit feedback about the similarity and dissimilarity of objects. For example, the feedback can be expressed as the following sets of object pairs:



  • ={(Oi, Oj ) : Oi is similar to Oj } D ={(Oi, Oj ) : Oi is dissimilar to Oj }.

How can this information be leveraged to improve the computation of similarity? Many specialized methods have been designed for supervised similarity computation. A common approach is to assume a specific closed form of the similarity function for which the param-eters need to be learned. An example is the weighted Lp-norm in Sect. 3.2.1.1, where the parameters represented by Θ correspond to the feature weights (a1 . . . ad). Therefore, the first step is to create a distance function f (Oi, Oj , Θ), where Θ is a set of unknown weights. Assume that higher values of the function indicate greater dissimilarity. Therefore, this is a distance function, rather than a similarity function. Then, it is desirable to determine the


88 CHAPTER 3. SIMILARITY AND DISTANCES


parameters Θ, so that the following conditions are satisfied as closely as possible:






Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   55   56   57   58   59   60   61   62   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin