n
|
n
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||Xi − Xp||2
|
|
|
||Xj − Xq||2
|
|
|
|
||Xp − Xq||2
|
|
|
|
2
|
|
|
|
2
|
|
p=1
|
+
|
q=1
|
|
p=1
|
q=1
|
|
|
+||Xj ||
|
=
|
−
|
|
||Xi||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
n
|
|
|
n
|
|
n2
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(2.22)
|
|
62 CHAPTER 2. DATA PREPARATION
Consider the time series 1, 1, 3, 3, 3, 3, 1, 1. Perform wavelet decomposition on the time series. How many coefficients of the series are nonzero?
Download the Intel Research Berkeley data set. Apply a wavelet transformation to the temperature values in the first sensor.
Treat each quantitative variable in the KDD CUP 1999 Network Intrusion Data Set from the UCI Machine Learning Repository [213] as a time series. Perform the wavelet decomposition of this time series.
Create samples of size n = 1, 10, 100, 1000, 10000 records from the data set of the previous exercise, and determine the average value ei of each quantitative column i using the sample. Let μi and σi be the global mean and standard deviation over the entire data set. Compute the number of standard deviations zi by which ei varies from
μi.
zi = |ei − μi|
σi
How does zi vary with n?
Show that any right singular vector y of A with 0 singular value satisfies Ay = 0.
Show that the diagonalization of a square matrix is a specialized variation of SVD.
Chapter 3
Similarity and Distances
“Love is the power to see similarity in the dissimilar.”—Theodor Adorno
3.1 Introduction
Many data mining applications require the determination of similar or dissimilar objects, patterns, attributes, and events in the data. In other words, a methodical way of quanti-fying similarity between data objects is required. Virtually all data mining problems, such as clustering, outlier detection, and classification, require the computation of similarity. A formal statement of the problem of similarity or distance quantification is as follows:
Given two objects O1 and O2, determine a value of the similarity Sim(O1, O2) (or dis-tance Dist(O1, O2)) between the two objects.
In similarity functions, larger values imply greater similarity, whereas in distance func-tions, smaller values imply greater similarity. In some domains, such as spatial data, it is more natural to talk about distance functions, whereas in other domains, such as text, it is more natural to talk about similarity functions. Nevertheless, the principles involved in the design of such functions are generally invariant across different data domains. This chap-ter will, therefore, use either of the terms “distance function” and “similarity function,” depending on the domain at hand. Similarity and distance functions are often expressed in closed form (e.g., Euclidean distance), but in some domains, such as time-series data, they are defined algorithmically and cannot be expressed in closed form.
Distance functions are fundamental to the effective design of data mining algorithms, because a poor choice in this respect may be very detrimental to the quality of the results. Sometimes, data analysts use the Euclidean function as a “black box” without much thought about the overall impact of such a choice. It is not uncommon for an inexperienced analyst to invest significant effort in the algorithmic design of a data mining problem, while treating the distance function subroutine as an afterthought. This is a mistake. As this chapter will elucidate, poor choices of the distance function can sometimes be disastrously misleading
C. C. Aggarwal, Data Mining: The Textbook, DOI 10.1007/978-3-319-14142-8 3
|
63
|
c Springer International Publishing Switzerland 2015
64 CHAPTER 3. SIMILARITY AND DISTANCES
depending on the application domain. Good distance function design is also crucial for type portability. As discussed in Sect. 2.4.4.3 of Chap. 2, spectral embedding can be used to convert a similarity graph constructed on any data type into multidimensional data.
Distance functions are highly sensitive to the data distribution, dimensionality, and data type. In some data types, such as multidimensional data, it is much simpler to define and compute distance functions than in other types such as time-series data. In some cases, user intentions (or training feedback on object pairs) are available to supervise the distance function design. Although this chapter will primarily focus on unsupervised methods, we will also briefly touch on the broader principles of using supervised methods.
This chapter is organized as follows. Section 3.2 studies distance functions for multidi-mensional data. This includes quantitative, categorical, and mixed attribute data. Similarity measures for text, binary, and set data are discussed in Sect. 3.3. Temporal data is discussed in Sect. 3.4. Distance functions for graph data are addressed in Sect. 3.5. A discussion of supervised similarity will be provided in Sect. 3.6. Section 3.7 gives a summary.
3.2 Multidimensional Data
Although multidimensional data are the simplest form of data, there is significant diversity in distance function design across different attribute types such as categorical or quantitative data. This section will therefore study each of these types separately.
3.2.1 Quantitative Data
The most common distance function for quantitative data is the Lp- norm. The Lp-norm between two data points X = (x1 . . . xd) and Y = (y1 . . . yd) is defined as follows:
|
|
|
|
|
d
|
1/p
|
|
|
|
|
|
|
|
|
|
|
Dist(
|
|
|
|
) =
|
|xi − yi|p
|
|
(3.1)
|
|
X,
|
Y
|
.
|
|
i=1
Two special cases of the Lp-norm are the Euclidean (p = 2) and the Manhattan (p = 1) metrics. These special cases derive their intuition from spatial applications where they have clear physical interpretability. The Euclidean distance is the straight -line distance between two data points. The Manhattan distance is the “city block” driving distance in a region in which the streets are arranged as a rectangular grid, such as the Manhattan Island of New York City.
A nice property of the Euclidean distance is that it is rotation-invariant because the straight-line distance between two data points does not change with the orientation of the axis system. This property also means that transformations, such as PCA, SVD, or the wavelet transformation for time series (discussed in Chap. 2), can be used on the data without affecting1 the distance. Another interesting special case is that obtained by setting p = ∞. The result of this computation is to select the dimension for which the two objects are the most distant from one another and report the absolute value of this distance. All other features are ignored.a
The Lp- norm is one of the most popular distance functions used by data mining analysts. One of the reasons for its popularity is the natural intuitive appeal and interpretability of L1- and L 2-norms in spatial applications. The intuitive interpretability of these distances does not, however, mean that they are the most relevant ones, especially for the high-dimensional case. In fact, these distance functions may not work very well when the data
The distances are affected after dimensions are dropped. However, the transformation itself does not impact distances.
Dostları ilə paylaş: |