Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	41/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 37 38 39 40 41 42 43 44 ... 423

1-Data Mining tarjima

Kernel PCA

Centered kernel matrix S

=(I−

)K(I −

)

(cf. Sect. 10.6.4.1 of Chap. 10)

objective function O of Eq. 2.16 is usually divided by _i,j_:_i δ_ij² to yield a value in (0, 1).
The square root of this value is referred to as Kruskal stress.
The basic assumption in classical MDS is that the distance matrix Δ = [δ2 ]	n×n	is
ij	n×n
generated by computing pairwise Euclidean distances in some hypothetical data matrix D

for which the entries and dimensionality are unknown. The matrix D can never be recovered completely in classical MDS because Euclidean distances are invariant to mean translation and axis rotation. The appropriate conventions for the data mean and axis orientation will be discussed later. While the optimization of Eq. 2.16 requires numerical techniques, a direct solution to classical MDS can be obtained by eigen decomposition under the assumption that the specified distance matrix is Euclidean:

Any pairwise (squared) distance matrix Δ = [δ_ij2 ]_n×n can be converted into a sym-metric dot-product matrix S_n×n with the help of the cosine law in Euclidean space. In particular, if X_i and X_j are the embedded representations of the ith and jth nodes, the dot product between X_i and X_j can be related to the distances as follows:

X_i · X_j = −	1	\|\|X_i − X_j \|\|² − (\|\|X_i\|\|² + \|\|X_j \|\|²) ∀i, j ∈ {1 . . . n}	(2.17)
	2

For a mean-centered embedding, the value of ||X_i||² + ||X_j ||² can be expressed (see

Exercise 9) in terms of the entries of the distance matrix Δ as follows:

||X_i − X_p||²

||X_j − X_q||²

||X_p − X_q||²

p=1

q=1

p=1

q=1

+||X_j ||

−

||X_i||

n²

(2.18)

A mean-centering assumption is necessary because the Euclidean distance is mean invariant, whereas the dot product is not. By substituting Eq. 2.18 in Eq. 2.17, it is possible to express the dot product X _i · X_j fully in terms of the entries of the distance matrix Δ. Because this condition is true for all possible values of i and j, we can conveniently express it in n × n matrix form. Let U be the n × n matrix of all 1s, and let I be the identity matrix. Then, our argument above shows that the dot-product matrix S is equal to − ¹₂ (I − ^U_n )Δ(I − ^U_n ). Under the Euclidean assumption, the matrix S is always positive semidefinite because it is equal to the n × n dot-product matrix DDT of the unobserved data matrix D, which has unknown dimensionality. Therefore, it is desired to determine a high-quality factorization of S into the form D_kD_kT , where D_k is an n × k matrix of dimensionality k.

2.4. DATA REDUCTION AND TRANSFORMATION

Such a factorization can be obtained with eigen decomposition. Let S ≈ Q_kΣ2_kQT_k = (Q_kΣ_k)(Q_kΣ_k)^T represent the approximate diagonalization of S, where Q_k is an n ×k matrix containing the largest k eigenvectors of S, and Σ2_k is a k × k diagonal matrix containing the eigenvalues. The embedded representation is given by D_k = Q_kΣ_k. Note that SVD also derives the optimal embedding as the scaled eigenvectors of the dot-product matrix of the original data. Therefore, the squared error of representation is minimized by this approach. This can also be shown to be equivalent to minimizing the Kruskal stress.

The optimal solution is not unique, because we can multiply Q_kΣ_k with any k × k matrix with orthonormal columns, and the pairwise Euclidean distances will not be aﬀected. In other words, any representation of Q_kΣ_k in a rotated axis system is optimal as well. MDS finds an axis system like PCA in which the individual attributes are uncorrelated. In fact, if classical MDS is applied to a distance matrix Δ, which is constructed by computing the pairwise Euclidean distances in an actual data set, then it will yield the same embedding as the application of PCA on that data set. MDS is useful when such a data set is not available to begin with, and only the distance matrix Δ is available.

As in all dimensionality reduction methods, the value of the dimensionality k provides the trade-oﬀ between representation size and accuracy. Larger values of the dimensionality k will lead to lower stress. A larger number of data points typically requires a larger dimensionality of representation to achieve the same stress. The most crucial element is, however, the inherent structure of the distance matrix. For example, if a 10, 000 × 10, 000 distance matrix contains the pairwise driving distance between 10,000 cities, it can usually be approximated quite well with just a 2-dimensional representation. This is because driving distances are an approximation of Euclidean distances in 2-dimensional space. On the other hand, an arbitrary distance matrix may not be Euclidean and the distances may not even satisfy the triangle inequality. As a result, the matrix S might not be positive semidefinite. In such cases, it is sometimes still possible to use the metric assumption to obtain a high-quality embedding. Specifically, only those positive eigenvalues may be used, whose magnitude exceeds that of the most negative eigenvalue. This approach will work reasonably well if the negative eigenvalues have small magnitude.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 37 38 39 40 41 42 43 44 ... 423