Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə194/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   190   191   192   193   194   195   196   197   ...   423
1-Data Mining tarjima

10.6. SUPPORT VECTOR MACHINES

325

a natural route for using SVMs in complex data types. This is because kernel methods only need the pairwise similarity between objects, and are agnostic to the feature values of the data points. Kernel functions have been defined for text, images, sequences, and graphs.


10.6.4.1 Other Applications of Kernel Methods


The use of kernel methods is not restricted to SVM methods. These methods can be extended to any technique in which the solutions are directly or indirectly expressed in terms of dot products. Examples include the Fisher’s discriminant, logistic regression, linear regression (cf. Sect. 11.5.4 of Chap. 11), dimensionality reduction, and k-means clustering.





  1. Kernel k-means: The key idea is that the Euclidean distance between a data point X and the cluster centroid μ of cluster C can be computed as a function of the dot product between X and the data points in C:
















































































































































































































C Xi · Xj








































C Xi




























C X · Xi
























































































Xi

,Xj






















2













Xi




2
















Xi

+

























= ||X−




||

= X·X−2




. (10.66)




||X−μ||

























|C|













|C|













|C|2




In kernel k-means, the dot products Xi · Xj are replaced with kernel similarity val-ues K(Xi, Xj ). For the data point X, the index of its assigned cluster is obtained by selecting the minimum value of the (kernel-based) distance in Eq. 10.66 over all clusters. Note that the cluster centroids in the transformed space do not need to be explicitly maintained over the different k-means iterations, although the cluster assign-ment indices for each data point need to be maintained for computation of Eq. 10.66. Because of its implicit nonlinear transformation approach, kernel k -means is able to discover arbitrarily shaped clusters like spectral clustering in spite of its use of the spherically biased Euclidean distance.





  1. Kernel PCA: In conventional SVD and PCA of an n × d mean-centered data matrix D, the basis vectors are given by the eigenvectors of DT D (columnwise dot product matrix), and the coordinates of the transformed points are extracted from the scaled eigenvectors of DDT (rowwise dot product matrix). While the basis vectors can no longer be derived in kernel PCA, the coordinates of the transformed data can be extracted. The rowwise dot product matrix DDT can be replaced with the kernel similarity matrix S = [K(Xi, Xj )]n×n. The similarity matrix is then adjusted for mean-centering of the data in the transformed space as S ⇐ (I − Un )S(I − Un ), where U is an n×n matrix containing all 1s (see Exercise 17). The assumption is that the matrix S can be approximately expressed as a dot product of the reduced data points in some k-dimensional transformed space. Therefore, one needs to approximately factorize S into the form AAT to extract its reduced n×k embedding A in the transformed space. This is achieved by eigen-decomposition. Let Qk be the n × k matrix containing the largest k eigenvectors of S, and Σk be the k × k diagonal matrix containing the square root of the corresponding eigenvalues. Then, it is evident that S ≈ QkΣ2kQTk = (QkΣk)(QkΣk)T , and the k-dimensional embeddings of the data points are given6 by the rows of the n × k matrix A = QkΣk. Note that this is a truncated version of the data-specific Mercer kernel map. This nonlinear embedding is similar to that obtained




  • The original result [450] uses a more general argument to derive S Qk Σk1 as the m × k matrix of

k-dimensional embedded coordinates of any out-of-sample m × d matrix D . Here, S = D DT is the m × n matrix of kernel similarities between out-of-sample points in D and in-sample points in D. However, when D = D, this expression is (more simply) equivalent to Qk Σk by expanding S = S ≈ Qk Σ2k QTk .

326 CHAPTER 10. DATA CLASSIFICATION


by ISOMAP; however, unlike ISOMAP, out- of-sample points can also be transformed to the new space. It is noteworthy that the embedding of spectral clustering is also expressed in terms of the large eigenvectors7 of a sparsified similarity matrix, which is better suited to preserving local similarities for clustering. In fact, most forms of nonlinear embeddings can be shown to be large eigenvectors of similarity matrices (cf. Table 2.3 of Chap. 2), and are therefore special cases of kernel PCA.


10.7 Neural Networks


Neural networks are a model of simulation of the human nervous system. The human nervous system is composed of cells, referred to as neurons. Biological neurons are connected to one another at contact points, which are referred to as synapses. Learning is performed in living organisms by changing the strength of synaptic connections between neurons. Typically, the strength of these connections change in response to external stimuli. Neural networks can be considered a simulation of this biological process.


As in the case of biological networks, the individual nodes in artificial neural networks are referred to as neurons. These neurons are units of computation that receive input from some other neurons, make computations on these inputs, and feed them into yet other neurons. The computation function at a neuron is defined by the weights on the input connections to that neuron. This weight can be viewed as analogous to the strength of a synaptic connection. By changing these weights appropriately, the computation function can be learned, which is analogous to the learning of the synaptic strength in biological neural networks. The “external stimulus” in artificial neural networks for learning these weights is provided by the training data. The idea is to incrementally modify the weights whenever incorrect predictions are made by the current set of weights.


The key to the effectiveness of the neural network is the architecture used to arrange the connections among nodes. A wide variety of architectures exist, starting from a simple single-layer perceptron to complex multilayer networks.


10.7.1 Single-Layer Neural Network: The Perceptron


The most basic architecture of a neural network is referred to as the perceptron. An example of the perceptron architecture is illustrated in Fig. 10.10a. The perceptron contains two layers of nodes, which correspond to the input nodes, and a single output node. The number of input nodes is exactly equal to the dimensionality d of the underlying data. Each input node receives and transmits a single numerical attribute to the output node. Therefore, the input nodes only transmit input values and do not perform any computation on these values. In the basic perceptron model, the output node is the only node that performs a mathematical function on its inputs. The individual features in the training data are assumed to be numerical. Categorical attributes are handled by creating a separate binary input for each value of the categorical attribute. This is logically equivalent to binarizing the categorical attribute into multiple attributes. For simplicity of further discussion, it will be assumed that all input variables are numerical. Furthermore, it will be assumed that the classification problem contains two possible values for the class label, drawn from {−1, +1}.





  • Refer to Sect. 19.3.4 of Chap. 19. The small eigenvectors of the symmetric Laplacian are the same as the large eigenvectors of S = Λ1/2W Λ1/2. Here, W is often defined by the sparsified heat-kernel similarity between data points, and the factors involving Λ1/2 provide local normalization of the similarity values to handle clusters of varying density.

10.7. NEURAL NETWORKS

327

As discussed earlier, each input node is connected by a weighted connection to the output node. These weights define a function from the values transmitted by the input nodes to a binary value drawn from {−1, +1}. This value can be interpreted as the perceptron’s prediction of the class variable of the test instance fed to the input nodes, for a binary-class value drawn from {−1, +1}. Just as learning is performed in biological systems by modifying synaptic strengths, the learning in a perceptron is performed by modifying the weights of the links connecting the input nodes to the output node whenever the predicted label does not match the true label.


The function learned by the perceptron is referred to as the activation function, which is a signed linear function. This function is very similar to that learned in SVMs for map-ping training instances to binary class labels. Let W = (w1 . . . wd) be the weights for the connections of d different inputs to the output neuron for a data record of dimensionality d. In addition, a bias b is associated with the activation function. The output zi ∈ {−1, +1} for the feature set (x1i . . . xdi) of the ith data record Xi, is as follows:








d







zi = sign{ wj xij + b}

(10.67)







j=1







= sign{




·




+ b}

(10.68)




W

Xi




The value zi represents the prediction of the perceptron for the class variable of Xi . It is, therefore, desired to learn the weights, so that the value of zi is equal to yi for as many training instances as possible. The error in prediction (zi − yi) may take on any of the values of 2, 0, or +2. A value of 0 is attained when the predicted class is correct. The goal in neural network algorithms is to learn the vector of weights W and bias b, so that zi approximates the true class variable yi as closely as possible.

The basic perceptron algorithm starts with a random vector of weights. The algorithm then feeds the input data items Xi into the neural network one by one to create the pre-diction zi. The weights are then updated, based on the error value (zi − yi). Specifically, when the data point Xi is fed into it in the tth iteration, the weight vector W t is updated as follows:








t+1 =




t + η(yi − zi)




.

(10.69)




W

W

Xi




The parameter η regulates the learning rate of the neural network. The perceptron algorithm repeatedly cycles through all the training examples in the data and iteratively adjusts the weights until convergence is reached. The basic perceptron algorithm is illustrated in Fig. 10.9. Note that a single training data point may be cycled through many times. Each such cycle is referred to as an epoch.

Let us examine the incremental term (yi − zi)Xi in the update of Eq. 10.69, without the multiplicative factor η. It can be shown that this term is a heuristic approximation8 of the negative of the gradient of the least-squares prediction error (yi − zi )2 = (yi sign(W · Xi − b))2 of the class variable, with respect to the vector of weights W . The update in this case is performed on a tuple-by-tuple basis, rather than globally, over the entire data set, as one would expect in a global least-squares optimization. Nevertheless, the basic perceptron algorithm can be considered a modified version of the gradient descent method, which implicitly minimizes the squared error of prediction. It is easy to see that nonzero updates are made to the weights only when errors are made in categorization. This is because the incremental term in Eq. 10.69 will be 0 whenever the predicted value zi is the same as the class label yi.








  • The derivative of the sign function is replaced by only the derivative of its argument. The derivative of the sign function is zero everywhere, except at zero, where it is indeterminate.

328 CHAPTER 10. DATA CLASSIFICATION

Algorithm Perceptron(Training Data: D)


begin

Initialize weight vector W to random values;

repeat
Receive next training tuple (Xi, yi);




zi = W · Xi + b;



  • = W + η(yi − zi)Xi; until convergence;

end










Figure 10.9: The perceptron algorithm





Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   190   191   192   193   194   195   196   197   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin