contextual attribute. For the case of discrete sequence data, many distance and similarity functions, such as the edit distance and the LCSS, are commonly used.
Because distance functions are often intended to model user notions of similarity, feed-back should be used, where possible, to improve the distance function design. This feedback can be used within the context of a parameterized model to learn the optimal parameters that are consistent with the user-provided feedback.
3.8 Bibliographic Notes
The problem of similarity computation has been studied extensively by data mining researchers and practitioners in recent years. The issues with high -dimensional data were explored in [17, 88, 266]. In the work of [88 ], the impact of the distance concentration effects on high-dimensional computation was analyzed. The work in [266] showed the rel-ative advantages of picking distance functions that are locality sensitive. The work also showed the advantages of the Manhattan metric over the Euclidean metric. Fractional met-rics were proposed in [17] and generally provide more accurate results than the Manhattan and Euclidean metric. The ISOMAP method discussed in this chapter was proposed in [490]. Numerous local methods are also possible for distance function computation. An example of an effective local method is the instance-based method proposed in [543].
Similarity in categorical data was explored extensively in [104]. In this work, a number of similarity measures were analyzed, and how they apply to the outlier detection problem was tested. The Goodall measure is introduced in [ 232]. The work in [122] uses information theoretic measures for computation of similarity. Most of the measures discussed in this chapter do not distinguish between mismatches on an attribute. However, a number of methods proposed in [74, 363, 473] distinguish between mismatches on an attribute value. The premise is that infrequent attribute values are statistically expected to be more different than frequent attribute values. Thus, in these methods, S(xi, yi) is not always set to 0 (or the same value) when xi and yi are different. A local similarity measure is presented in [182]. Text similarity measures have been studied extensively in the information retrieval literature [441].
The area of time-series similarity measures is a rich one, and a significant number of algorithms have been designed in this context. An excellent tutorial on the topic may be found in [241]. The use of wavelets for similarity computation in time series is discussed in [130]. While DTW has been used extensively in the context of speech recognition, its use in data mining applications was first proposed by [87]. Subsequently, it has been used extensively [526] for similarity- based applications in data mining. The major challenge in data mining applications is its computationally intensive nature. Numerous methods [ 307] have been proposed in the time series data mining literature to speed up DTW. A fast method for computing a lower bound on DTW was proposed in [308], and how this can be used for exact indexing was shown. A window-based approach for computing similarity in sequences with noise, scaling, and translation was proposed in [53 ]. Methods for similarity search in multivariate time series and sequences were proposed in [499, 500]. The edit distance has been used extensively in biological data for computing similarity between sequences [244]. The use of transformation rules for time-series similarity has been studied in [283, 432]. Such rules can be used to create edit distance-like measures for continuous time series. Methods for the string- edit distance are proposed in [438]. It has been shown in [141], how the Lp-norm may be combined with the edit distance. Algorithms for the LCSS problem may be found in [77, 92, 270, 280]. A survey of these algorithms is available
90 CHAPTER 3. SIMILARITY AND DISTANCES
in [92]. A variety of other measures for time series and sequence similarity are discussed in [32].
Numerous methods are available for similarity search in graphs. A variety of efficient shortest-path algorithms for finding distances between nodes may be found in [ 62]. The page rank algorithm is discussed in the Web mining book [357]. The NP-hardness of the graph isomorphism problem, and other closely related problems to the edit distance are discussed in [221]. The relationship between the maximum common subgraph problem and the graph-edit distance problem has been studied in [119, 120]. The problem of substructure similarity search, and the use of substructures for similarity search have been addressed in [520, 521]. A notion of mutation distance has been proposed in [522] to measure the distances between graphs. A method that uses the frequent substructures of a graph for similarity computation in clustering is proposed in [42]. A survey on graph-matching techniques may be found in [26].
User supervision has been studied extensively in the context of distance function learn-ing. One of the earliest methods that parameterizes the weights of the Lp-norm was proposed in [15]. The problem of distance function learning has been formally related to that of clas-sification and has been studied recently in great detail. A survey that covers the important topics in distance function learning is provided in [33].
3.9 Exercises
Compute the Lp-norm between (1, 2) and (3, 4) for p = 1, 2, ∞.
Show that the Mahalanobis distance between two data points is equivalent to the Euclidean distance on a transformed data set, where the transformation is performed by representing the data along the principal components, and dividing by the standard deviation of each component.
Download the Ionosphere data set from the UCI Machine Learning Repository [213], and compute the Lp distance between all pairs of data points, for p = 1, 2, and ∞. Compute the contrast measure on the data set for the different norms. Repeat the exercise after sampling the first r dimensions, where r varies from 1 to the full dimensionality of the data.
Compute the match-based similarity, cosine similarity, and the Jaccard coefficient, between the two sets {A, B, C} and {A, C, D, E}.
Let X and Y be two data points. Show that the cosine angle between the vectors X and Y is given by:
Dostları ilə paylaş: |