Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	42/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 38 39 40 41 42 43 44 45 ... 423

1-Data Mining tarjima

MDS is commonly used in nonlinear dimensionality reduction methods such as ISOMAP (cf. Sect. 3.2.1.7 of Chap. 3). It is noteworthy that, in conventional SVD, the scaled eigen-vectors of the n × n dot-product similarity matrix DDT yield a low-dimensional embedded representation of D just as the eigenvectors of S yield the embedding in MDS. The eigen decomposition of similarity matrices is fundamental to many linear and nonlinear dimen-sionality reduction methods such as PCA, SVD, ISOMAP, kernel PCA, and spectral embed-ding. The specific properties of each embedding are a result of the choice of the similarity matrix and the scaling used on the resulting eigenvectors. Table 2.3 provides a preliminary comparison of these methods, although some of them are discussed in detail only in later chapters.
2.4.4.3 Spectral Transformation and Embedding of Graphs

Whereas MDS methods are designed for preserving global distances, spectral methods are designed for preserving local distances for applications such as clustering. Spectral methods work with similarity graphs in which the weights on the edges represent similarity rather than distances. When distance values are available they are converted to similarity values with kernel functions such as the heat kernel discussed earlier in this chapter. The notion

58 CHAPTER 2. DATA PREPARATION

of similarity is natural to many real Web, social, and information networks because of the notion of homophily. For example, consider a bibliographic network in which nodes cor-respond to authors, and the edges correspond to co-authorship relations. The weight of an edge represents the number of publications between authors and therefore represents one possible notion of similarity in author publications. Similarity graphs can also be con-structed between arbitrary data types. For example, a set of n time series can be converted into a graph with n nodes, where a node represents each time series. The weight of an edge is equal to the similarity between the two nodes, and only edges with a “suﬃcient” level of similarity are retained. A discussion of the construction of the similarity graph is provided in Sect. 2.2.2.9. Therefore, if a similarity graph can be transformed to a multidi-mensional representation that preserves the similarity structure between nodes, it provides a transformation that can port virtually any data type to the easily usable multidimen-sional representation. The caveat here is that such a transformation can only be used for similarity-based applications such as clustering or nearest neighbor classification because the transformation is designed to preserve the local similarity structure. The local similarity structure of a data set is nevertheless fundamental to many data mining applications.

Let G = (N, A) be an undirected graph with node set N and edge set A. It is assumed that the node set contains n nodes. A symmetric n × n weight matrix W = [w_ij ] represents the similarities between the diﬀerent nodes. Unlike MDS, which works with a complete graph of global distances, this graph is generally a sparsified representation of the similarity of each object to its k nearest objects (cf. Sect. 2.2.2.9). The similarities to the remaining objects are not distinguished from one another and set to 0. This is because spectral methods preserve only the local similarity structure for applications such as clustering. All entries in this matrix are assumed to be nonnegative, and higher values indicate greater similarity. If an edge does not exist between a pair of nodes, then the corresponding entry is assumed to be 0. It is desired to embed the nodes of this graph into a k-dimensional space so that the similarity structure of the data is preserved.

First, let us discuss the much simpler problem of mapping the nodes onto a 1-dimensional space. The generalization to the k-dimensional case is relatively straightforward. We would like to map the nodes in N into a set of 1-dimensional real values y₁ . . . y_n on a line, so that the distances between these points reflect the edge connectivity among the nodes. Therefore, it is undesirable for nodes that are connected with high-weight edges, to be mapped onto distant points on this line. Therefore, we would like to determine values of y_i that minimize the following objective function O:

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 38 39 40 41 42 43 44 45 ... 423