Partition the data into a set of local regions.
For any pair of objects, determine the most relevant region for the pair, and compute the pairwise distances using the local statistics of that region. For example, the local Mahalanobis distance may be used in each local region.
A variety of clustering methods are used for partitioning the data into local regions. In cases where each of the objects in the pair belongs to a different region, either the global distribution may be used, or the average may be computed using both local regions. Another problem is that the first step of the algorithm (partitioning process) itself requires a notion of distances for clustering. This makes the solution circular, and calls for an iterative solution. Although a detailed discussion of these methods is beyond the scope of this book, the bibliographic notes at the end of this chapter provide a number of pointers.
3.2.1.9 Computational Considerations
A major consideration in the design of distance functions is the computational complexity. This is because distance function computation is often embedded as a subroutine that is used repeatedly in the application at hand. If the subroutine is not efficiently implementable, the applicability becomes more restricted. For example, methods such as ISOMAP are computationally expensive and hard to implement for very large data sets because these methods scale with at least the square of the data size. However, they do have the merit that a one-time transformation can create a representation that can be used efficiently by data mining algorithms. Distance functions are executed repeatedly, whereas the preprocessing is performed only once. Therefore, it is definitely advantageous to use a preprocessing-intensive approach as long as it speeds up later computations. For many applications, sophisticated methods such as ISOMAP may be too expensive even for one-time analysis. For such cases, one of the earlier methods discussed in this chapter may need to be used. Among the methods discussed in this section, carefully chosen Lp-norms and match-based techniques are the fastest methods for large-scale applications.
74 CHAPTER 3. SIMILARITY AND DISTANCES
3.2.2 Categorical Data
Distance functions are naturally computed as functions of value differences along dimensions in numeric data, which is ordered. However, no ordering exists among the discrete values of categorical data. How can distances be computed? One possibility is to transform the categorical data to numeric data with the use of the binarization approach discussed in Sect. 2.2.2.2 of Chap. 2. Because the binary vector is likely to be sparse (many zero values), similarity functions can be adapted from other sparse domains such as text. For the case of categorical data, it is more common to work with similarity functions rather than distance functions because discrete values can be matched more naturally.
Consider two records X = (x1 . . . xd) and Y = (y1 . . . yd). The simplest possible similar-ity between the records X and Y is the sum of the similarities on the individual attribute values. In other words, if S(xi, yi) is the similarity between the attributes values xi and yi, then the overall similarity is defined as follows:
d
Sim(X, Y ) = S(xi, yi).
i=1
Therefore, the choice of S(xi, yi) defines the overall similarity function.
The simplest possible choice is to set S(x i, yi) to 1 when x i = yi and 0 otherwise. This is also referred to as the overlap measure. The major drawback of this measure is that it does not account for the relative frequencies among the different attributes. For example, consider a categorical attribute in which the attribute value is “Normal” for 99 % of the records, and either “Cancer” or “Diabetes” for the remaining records. Clearly, if two records have a “Normal” value for this variable, this does not provide statistically significant information about the similarity, because the majority of pairs are likely to show that pattern just by chance. However, if the two records have a matching “Cancer” or “Diabetes” value for this variable, it provides significant statistical evidence of similarity. This argument is similar to that made earlier about the importance of the global data distribution. Similarities or differences that are unusual are statistically more significant than those that are common.
In the context of categorical data, the aggregate statistical properties of the data set should be used in computing similarity. This is similar to how the Mahalanobis distance was used to compute similarity more accurately with the use of global statistics. The idea is that matches on unusual values of a categorical attribute should be weighted more heavily than values that appear frequently. This also forms the underlying principle of many com-mon normalization techniques that are used in domains such as text. An example, which is discussed in the next section, is the use of inverse document frequency (IDF) in the information retrieval domain. An analogous measure for categorical data will be introduced here.
The inverse occurrence frequency is a generalization of the simple matching measure. This measure weights the similarity between the matching attributes of two records by an inverse function of the frequency of the matched value. Thus, when xi = yi, the similarity S(xi, yi) is equal to the inverse weighted frequency, and 0 otherwise. Let pk(x) be the fraction of records in which the kth attribute takes on the value of x in the data set. In other words, when xi = yi, the value of S(xi, yi) is 1/pk(xi)2 and 0 otherwise.
Dostları ilə paylaş: |