Data Mining: The Textbook

Partition the data into a set of local regions

Yüklə 17,13 Mb.

səhifə	49/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 45 46 47 48 49 50 51 52 ... 423

1-Data Mining tarjima

Partition the data into a set of local regions.

For any pair of objects, determine the most relevant region for the pair, and compute the pairwise distances using the local statistics of that region. For example, the local Mahalanobis distance may be used in each local region.

A variety of clustering methods are used for partitioning the data into local regions. In cases where each of the objects in the pair belongs to a diﬀerent region, either the global distribution may be used, or the average may be computed using both local regions. Another problem is that the first step of the algorithm (partitioning process) itself requires a notion of distances for clustering. This makes the solution circular, and calls for an iterative solution. Although a detailed discussion of these methods is beyond the scope of this book, the bibliographic notes at the end of this chapter provide a number of pointers.

3.2.1.9 Computational Considerations

A major consideration in the design of distance functions is the computational complexity. This is because distance function computation is often embedded as a subroutine that is used repeatedly in the application at hand. If the subroutine is not eﬃciently implementable, the applicability becomes more restricted. For example, methods such as ISOMAP are computationally expensive and hard to implement for very large data sets because these methods scale with at least the square of the data size. However, they do have the merit that a one-time transformation can create a representation that can be used eﬃciently by data mining algorithms. Distance functions are executed repeatedly, whereas the preprocessing is performed only once. Therefore, it is definitely advantageous to use a preprocessing-intensive approach as long as it speeds up later computations. For many applications, sophisticated methods such as ISOMAP may be too expensive even for one-time analysis. For such cases, one of the earlier methods discussed in this chapter may need to be used. Among the methods discussed in this section, carefully chosen L_p-norms and match-based techniques are the fastest methods for large-scale applications.

74 CHAPTER 3. SIMILARITY AND DISTANCES

3.2.2 Categorical Data

Distance functions are naturally computed as functions of value diﬀerences along dimensions in numeric data, which is ordered. However, no ordering exists among the discrete values of categorical data. How can distances be computed? One possibility is to transform the categorical data to numeric data with the use of the binarization approach discussed in Sect. 2.2.2.2 of Chap. 2. Because the binary vector is likely to be sparse (many zero values), similarity functions can be adapted from other sparse domains such as text. For the case of categorical data, it is more common to work with similarity functions rather than distance functions because discrete values can be matched more naturally.

Consider two records X = (x₁ . . . x_d) and Y = (y₁ . . . y_d). The simplest possible similar-ity between the records X and Y is the sum of the similarities on the individual attribute values. In other words, if S(x_i, y_i) is the similarity between the attributes values x_i and y_i, then the overall similarity is defined as follows:

d
Sim(X, Y ) = S(x_i, y_i).

i=1

Therefore, the choice of S(x_i, y_i) defines the overall similarity function.

The simplest possible choice is to set S(x _i, y_i) to 1 when x _i = y_i and 0 otherwise. This is also referred to as the overlap measure. The major drawback of this measure is that it does not account for the relative frequencies among the diﬀerent attributes. For example, consider a categorical attribute in which the attribute value is “Normal” for 99 % of the records, and either “Cancer” or “Diabetes” for the remaining records. Clearly, if two records have a “Normal” value for this variable, this does not provide statistically significant information about the similarity, because the majority of pairs are likely to show that pattern just by chance. However, if the two records have a matching “Cancer” or “Diabetes” value for this variable, it provides significant statistical evidence of similarity. This argument is similar to that made earlier about the importance of the global data distribution. Similarities or diﬀerences that are unusual are statistically more significant than those that are common.

In the context of categorical data, the aggregate statistical properties of the data set should be used in computing similarity. This is similar to how the Mahalanobis distance was used to compute similarity more accurately with the use of global statistics. The idea is that matches on unusual values of a categorical attribute should be weighted more heavily than values that appear frequently. This also forms the underlying principle of many com-mon normalization techniques that are used in domains such as text. An example, which is discussed in the next section, is the use of inverse document frequency (IDF) in the information retrieval domain. An analogous measure for categorical data will be introduced here.

The inverse occurrence frequency is a generalization of the simple matching measure. This measure weights the similarity between the matching attributes of two records by an inverse function of the frequency of the matched value. Thus, when x_i = y_i, the similarity S(x_i, y_i) is equal to the inverse weighted frequency, and 0 otherwise. Let p_k(x) be the fraction of records in which the kth attribute takes on the value of x in the data set. In other words, when x_i = y_i, the value of S(x_i, y_i) is 1/p_k(x_i)² and 0 otherwise.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 45 46 47 48 49 50 51 52 ... 423