Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	198/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 194 195 196 197 198 199 200 201 ... 423

1-Data Mining tarjima

X,	Y	X	Y	X	Y

This distance function is the same as the Euclidean metric when A is the identity matrix. Diﬀerent choices of A can lead to better sensitivity of the distance function to the local and global data distributions. These diﬀerent choices will be discussed in the following subsections.

10.8.1.1 Unsupervised Mahalanobis Metric

The Mahalanobis metric is introduced in Chap. 3. In this case, the value of A is chosen to be the inverse of the d × d covariance matrix Σ of the data set. The (i, j)th entry of the matrix Σ is the covariance between the dimensions i and j. Therefore, the Mahalanobis distance is defined as follows:

Dist(

) = (

−

)Σ−1(

−

)^T .

(10.72)

The Mahalanobis metric adjusts well to the diﬀerent scaling of the dimensions and the redundancies across diﬀerent features. Even when the data is uncorrelated, the Mahalanobis metric is useful because it auto-scales for the naturally diﬀerent ranges of attributes describ-ing diﬀerent physical quantities, such as age and salary. Such a scaling ensures that no single attribute dominates the distance function. In cases where the attributes are correlated, the Mahalanobis metric accounts well for the varying redundancies in diﬀerent features. How-ever, its major weakness is that it does not account for the varying shapes of the class distributions in the underlying data.

10.8.1.2 Nearest Neighbors with Linear Discriminant Analysis

To obtain the best results with a nearest-neighbor classifier, the distance function needs to account for the varying distribution of the diﬀerent classes. For example, in the case of Fig. 10.11, there are two classes A and B, which are represented by “.” and “*,” respectively. The test instance denoted by X lies on the side of the boundary related to class A. However, the Euclidean metric does not adjust well to the arrangement of the class distribution, and a circle drawn around the test instance seems to include more points from class B than class A.

One way of resolving the challenges associated with this scenario, is to weight the most discriminating directions more in the distance function with an appropriate choice of the

This approach is also referred to as leave-one-out cross-validation, and is described in detail in Sect. 10.9 on classifier evaluation.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 194 195 196 197 198 199 200 201 ... 423