Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	194/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 190 191 192 193 194 195 196 197 ... 423

1-Data Mining tarjima

10.6. SUPPORT VECTOR MACHINES

325

a natural route for using SVMs in complex data types. This is because kernel methods only need the pairwise similarity between objects, and are agnostic to the feature values of the data points. Kernel functions have been defined for text, images, sequences, and graphs.

10.6.4.1 Other Applications of Kernel Methods

The use of kernel methods is not restricted to SVM methods. These methods can be extended to any technique in which the solutions are directly or indirectly expressed in terms of dot products. Examples include the Fisher’s discriminant, logistic regression, linear regression (cf. Sect. 11.5.4 of Chap. 11), dimensionality reduction, and k-means clustering.

Kernel k-means: The key idea is that the Euclidean distance between a data point X and the cluster centroid μ of cluster C can be computed as a function of the dot product between X and the data points in C:

_C^Xi ^· ^Xj

_C^Xi

_C^X ^· ^Xi

^Xi

,X_j

^Xi

= ||X−

= X·X−2

. (10.66)

||X−μ||

|C|

|C|²

In kernel k-means, the dot products X_i · X_j are replaced with kernel similarity val-ues K(X_i, X_j ). For the data point X, the index of its assigned cluster is obtained by selecting the minimum value of the (kernel-based) distance in Eq. 10.66 over all clusters. Note that the cluster centroids in the transformed space do not need to be explicitly maintained over the diﬀerent k-means iterations, although the cluster assign-ment indices for each data point need to be maintained for computation of Eq. 10.66. Because of its implicit nonlinear transformation approach, kernel k -means is able to discover arbitrarily shaped clusters like spectral clustering in spite of its use of the spherically biased Euclidean distance.

Kernel PCA: In conventional SVD and PCA of an n × d mean-centered data matrix D, the basis vectors are given by the eigenvectors of DT D (columnwise dot product matrix), and the coordinates of the transformed points are extracted from the scaled eigenvectors of DDT (rowwise dot product matrix). While the basis vectors can no longer be derived in kernel PCA, the coordinates of the transformed data can be extracted. The rowwise dot product matrix DDT can be replaced with the kernel similarity matrix S = [K(X_i, X_j )]_n×n. The similarity matrix is then adjusted for mean-centering of the data in the transformed space as S ⇐ (I − ^U_n )S(I − ^U_n ), where U is an n×n matrix containing all 1s (see Exercise 17). The assumption is that the matrix S can be approximately expressed as a dot product of the reduced data points in some k-dimensional transformed space. Therefore, one needs to approximately factorize S into the form AAT to extract its reduced n×k embedding A in the transformed space. This is achieved by eigen-decomposition. Let Q_k be the n × k matrix containing the largest k eigenvectors of S, and Σ_k be the k × k diagonal matrix containing the square root of the corresponding eigenvalues. Then, it is evident that S ≈ Q_kΣ2_kQT_k = (Q_kΣ_k)(Q_kΣ_k)T , and the k-dimensional embeddings of the data points are given6 by the rows of the n × k matrix A = Q_kΣ_k. Note that this is a truncated version of the data-specific Mercer kernel map. This nonlinear embedding is similar to that obtained

The original result [450] uses a more general argument to derive S Q_k Σ⁻_k¹ as the m × k matrix of

k-dimensional embedded coordinates of any out-of-sample m × d matrix D . Here, S = D D^T is the m × n matrix of kernel similarities between out-of-sample points in D and in-sample points in D. However, when D = D, this expression is (more simply) equivalent to Q_k Σ_k by expanding S = S ≈ Q_k Σ²_k Q^T_k .

326 CHAPTER 10. DATA CLASSIFICATION

by ISOMAP; however, unlike ISOMAP, out- of-sample points can also be transformed to the new space. It is noteworthy that the embedding of spectral clustering is also expressed in terms of the large eigenvectors7 of a sparsified similarity matrix, which is better suited to preserving local similarities for clustering. In fact, most forms of nonlinear embeddings can be shown to be large eigenvectors of similarity matrices (cf. Table 2.3 of Chap. 2), and are therefore special cases of kernel PCA.

10.7 Neural Networks

Neural networks are a model of simulation of the human nervous system. The human nervous system is composed of cells, referred to as neurons. Biological neurons are connected to one another at contact points, which are referred to as synapses. Learning is performed in living organisms by changing the strength of synaptic connections between neurons. Typically, the strength of these connections change in response to external stimuli. Neural networks can be considered a simulation of this biological process.

As in the case of biological networks, the individual nodes in artificial neural networks are referred to as neurons. These neurons are units of computation that receive input from some other neurons, make computations on these inputs, and feed them into yet other neurons. The computation function at a neuron is defined by the weights on the input connections to that neuron. This weight can be viewed as analogous to the strength of a synaptic connection. By changing these weights appropriately, the computation function can be learned, which is analogous to the learning of the synaptic strength in biological neural networks. The “external stimulus” in artificial neural networks for learning these weights is provided by the training data. The idea is to incrementally modify the weights whenever incorrect predictions are made by the current set of weights.

The key to the eﬀectiveness of the neural network is the architecture used to arrange the connections among nodes. A wide variety of architectures exist, starting from a simple single-layer perceptron to complex multilayer networks.

10.7.1 Single-Layer Neural Network: The Perceptron

The most basic architecture of a neural network is referred to as the perceptron. An example of the perceptron architecture is illustrated in Fig. 10.10a. The perceptron contains two layers of nodes, which correspond to the input nodes, and a single output node. The number of input nodes is exactly equal to the dimensionality d of the underlying data. Each input node receives and transmits a single numerical attribute to the output node. Therefore, the input nodes only transmit input values and do not perform any computation on these values. In the basic perceptron model, the output node is the only node that performs a mathematical function on its inputs. The individual features in the training data are assumed to be numerical. Categorical attributes are handled by creating a separate binary input for each value of the categorical attribute. This is logically equivalent to binarizing the categorical attribute into multiple attributes. For simplicity of further discussion, it will be assumed that all input variables are numerical. Furthermore, it will be assumed that the classification problem contains two possible values for the class label, drawn from {−1, +1}.

Refer to Sect. 19.3.4 of Chap. 19. The small eigenvectors of the symmetric Laplacian are the same as the large eigenvectors of S = Λ⁻¹^/²W Λ⁻¹^/². Here, W is often defined by the sparsified heat-kernel similarity between data points, and the factors involving Λ⁻¹^/² provide local normalization of the similarity values to handle clusters of varying density.

10.7. NEURAL NETWORKS

327

As discussed earlier, each input node is connected by a weighted connection to the output node. These weights define a function from the values transmitted by the input nodes to a binary value drawn from {−1, +1}. This value can be interpreted as the perceptron’s prediction of the class variable of the test instance fed to the input nodes, for a binary-class value drawn from {−1, +1}. Just as learning is performed in biological systems by modifying synaptic strengths, the learning in a perceptron is performed by modifying the weights of the links connecting the input nodes to the output node whenever the predicted label does not match the true label.

The function learned by the perceptron is referred to as the activation function, which is a signed linear function. This function is very similar to that learned in SVMs for map-ping training instances to binary class labels. Let W = (w₁ . . . w_d) be the weights for the connections of d diﬀerent inputs to the output neuron for a data record of dimensionality d. In addition, a bias b is associated with the activation function. The output z_i ∈ {−1, +1} for the feature set (x¹_i . . . x^d_i) of the ith data record X_i, is as follows:

	d
z_i = sign{ w_j x_i^j + b}						(10.67)
	j=1
= sign{			·		+ b}	(10.68)
		W		^Xi

The value z_i represents the prediction of the perceptron for the class variable of X_i . It is, therefore, desired to learn the weights, so that the value of z_i is equal to y_i for as many training instances as possible. The error in prediction (z_i − y_i) may take on any of the values of −2, 0, or +2. A value of 0 is attained when the predicted class is correct. The goal in neural network algorithms is to learn the vector of weights W and bias b, so that z_i approximates the true class variable y_i as closely as possible.

The basic perceptron algorithm starts with a random vector of weights. The algorithm then feeds the input data items X_i into the neural network one by one to create the pre-diction z_i. The weights are then updated, based on the error value (z_i − y_i). Specifically, when the data point X_i is fed into it in the tth iteration, the weight vector W ^t is updated as follows:

	t+1 ₌		^t+ η(y_i − z_i)		.	(10.69)
W		W		X_i

The parameter η regulates the learning rate of the neural network. The perceptron algorithm repeatedly cycles through all the training examples in the data and iteratively adjusts the weights until convergence is reached. The basic perceptron algorithm is illustrated in Fig. 10.9. Note that a single training data point may be cycled through many times. Each such cycle is referred to as an epoch.

Let us examine the incremental term (y_i − z_i)X_i in the update of Eq. 10.69, without the multiplicative factor η. It can be shown that this term is a heuristic approximation8 of the negative of the gradient of the least-squares prediction error (y_i − z_i )2 = (y_i − sign(W · X_i − b))2 of the class variable, with respect to the vector of weights W . The update in this case is performed on a tuple-by-tuple basis, rather than globally, over the entire data set, as one would expect in a global least-squares optimization. Nevertheless, the basic perceptron algorithm can be considered a modified version of the gradient descent method, which implicitly minimizes the squared error of prediction. It is easy to see that nonzero updates are made to the weights only when errors are made in categorization. This is because the incremental term in Eq. 10.69 will be 0 whenever the predicted value z_i is the same as the class label y_i.

The derivative of the sign function is replaced by only the derivative of its argument. The derivative of the sign function is zero everywhere, except at zero, where it is indeterminate.

328 CHAPTER 10. DATA CLASSIFICATION

Algorithm Perceptron(Training Data: D)

begin

Initialize weight vector W to random values;

repeat
Receive next training tuple (X_i, y_i);

z_i = W · X_i + b;

= W + η(y_i − z_i)X_i; until convergence;

end

				Figure 10.9: The perceptron algorithm

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 190 191 192 193 194 195 196 197 ... 423