Data Mining: The Textbook
one can change the objective function to be optimized. For example, PLSA (cf. Sect. 13.4 of Chap. 13) interprets the non-negative elements of the (scaled) matrix as probabilities and maximizes the likelihood estimate of a generative model with respect to the observed matrix elements. The different variations of matrix factorization provide different types of utility in various applications: The latent factors in NMF are more easily interpretable for clustering applications, because of non-negativity. For example, in application domains such as text clustering, each of the k columns in U and V can be associated with document clusters and word clusters, respectively. The magnitudes of the non-negative (transformed) coordinates reflect which concepts are strongly expressed in a document. This “additive parts” representation of NMF is highly interpretable, especially in domains such as text, in which the features have semantic meaning. This is not possible with SVD in which transformed coordinate values and basis vector components may be negative. This is also the reason that NMF transformations are more useful than those of SVD for clustering. Similarly, the probabilistic forms of non-negative matrix factorization, such as PLSA, are also used commonly for clustering. It is instructive to compare the example of Fig. 6.22, with the SVD of the same matrix at the end of Sect. 2.4.3.2 in Chap. 2. Note that the NMF factorization is more easily interpretable. Unlike SVD, the k latent factors of NMF are not orthogonal to one another. This is a disadvantage of NMF because orthogonality of the axis-system allows intuitive interpretations of the data transformation as an axis-rotation. It is easy to project out-of-sample data points (i.e., data points not included in D) on an orthonormal basis system. Furthermore, distance computations between transformed data points are more meaningful in SVD. The addition of a constraint, such as non-negativity, to any optimization problem usu-ally reduces the quality of the solution found. However, the addition of orthogonality constraints, as in SVD, do not affect the theoretical global optimum of the uncon-strained matrix factorization formulation (see Exercise 13). Therefore, SVD provides better rank-k approximations than NMF. Furthermore, it is much easier in practice to determine the global optimum of SVD, as compared to unconstrained matrix fac-torization for matrices that are completely specified. Thus, SVD provides one of the alternate global optima of unconstrained matrix factorization, which is computation-ally easy to determine. SVD is generally hard to implement for incomplete data matrices as compared to many other variations of matrix factorization. This is relevant in recommender sys-tems where rating matrices are incomplete. The use of latent factor models for rec-ommendations is discussed in Sect. 18.5.5 of Chap. 18. Thus, SVD and NMF have different advantages and disadvantages and may be more suitable for different applications. 6.9 Cluster Validation After a clustering of the data has been determined, it is important to evaluate its quality. This problem is referred to as cluster validation. Cluster validation is often difficult in real data sets because the problem is defined in an unsupervised way. Therefore, no external 196 CHAPTER 6. CLUSTER ANALYSIS validation criteria may be available to evaluate a clustering. Thus, a number of internal criteria may be defined to validate the quality of a clustering. The major problem with internal criteria is that they may be biased toward one algorithm or the other, depending on how they are defined. In some cases, external validation criteria may be available when a test data set is synthetically generated, and therefore the true (ground- truth) clusters are known. Alternatively, for real data sets, the class labels, if available, may be used as proxies for the cluster identifiers. In such cases, the evaluation is more effective. Such criteria are referred to as external validation criteria. 6.9.1 Internal Validation Criteria Internal validation criteria are used when no external criteria are available to evaluate the quality of a clustering. In most cases, the criteria used to validate the quality of the algorithm are borrowed directly from the objective function, which is optimized by a particular clus-tering model. For example, virtually any of the objective functions in the k-representatives, EM algorithms, and agglomerative methods could be used for validation purposes. The problem with the use of these criteria is obvious in comparing algorithms with disparate methodologies. A validation criterion will always favor a clustering algorithm that uses a similar kind of objective function for its optimization. Nevertheless, in the absence of exter-nal validation criteria, this is the best that one can hope to achieve. Such criteria can also be effective in comparing two algorithms using the same broad approach. The commonly used internal evaluation criteria are as follows: Yüklə 17,13 Mb. Dostları ilə paylaş: |