Data Mining: The Textbook
about the normal patterns of users. This can be used to reorganize the site, or optimize its structure. In some cases, such information about normal patterns can be used to discover anomalies that do not conform to the normal patterns of interaction. A related domain is that of biological sequence data where clusters of sequences are related to their underlying biological properties. 7.8.6 Social Network Analysis Clustering methods are useful for finding related communities of users in social-networking Web sites. This problem is known as community detection. Community detection has a wide variety of other applications in network science, such as anomaly detection, classification, influence analysis, and link prediction. These applications are discussed in detail in Chap. 19 on social network analysis. 7.9 Summary This chapter discusses a number of advanced scenarios for cluster analysis. These scenarios include the clustering of advanced data types such as categorical data, large-scale data, and high-dimensional data. Many traditional clustering algorithms can be modified to work with categorical data by making changes to specific criteria, such as the similarity function or mixture model. Scalable algorithms require a change in algorithm design to reduce the number of passes over the data. High-dimensional data is the most difficult case because of the presence of many irrelevant features in the underlying data. Because clustering algorithms yield many alternative solutions, supervision can help guide the cluster discovery process. This supervision can either be in the form of background knowledge or user interaction. In some cases, the alternative clusterings can be combined to create a consensus clustering that is more robust than the solution obtained from a single model. 7.10 Bibliographic Notes The problem of clustering categorical data is closely related to that of finding suit-able similarity measures [104, 182], because many clustering algorithms use similarity measures as a subroutine. The k-modes and a fuzzy version of the algorithm may be found in [135, 278]. Popular clustering algorithms include ROCK [238], CACTUS [220], LIMBO [75], and STIRR [229]. The three scalable clustering algorithms discussed in this book are CLARANS [407], BIRCH [549], and CURE [239]. The high-dimensional clus-tering algorithms discussed in this chapter include CLIQUE [58], PROCLUS [19], and ORCLUS [22]. Detailed surveys on many different types of categorical, scalable, and high-dimensional clustering algorithms may be found in [32]. Methods for semisupervised clustering with the use of seeding, constraints, metric learn-ing, probabilistic learning, and graph-based learning are discussed in [80, 81, 94, 329]. The IPCLUS method presented in this chapter was first presented in [43]. Two other tools that are able to discover clusters by visualizing lower dimensional subspaces include HD-Eye [268] and RNavGraph [502]. The cluster ensemble framework was first proposed in [479]. The hypergraph partitioning algorithm HMETIS, which is used in ensemble clustering, was proposed in [302]. Subsequently, the utility of the method has also been demonstrated for high-dimensional data [205]. 236 CHAPTER 7. CLUSTER ANALYSIS: ADVANCED CONCEPTS 7.11 Exercises Implement the k-modes algorithm. Download the KDD CUP 1999 Network Intrusion Data Set [213] from the UCI Machine Learning Repository, and apply the algorithm to the categorical attributes of the data set. Compute the cluster purity with respect to class labels. Repeat the previous exercise with an implementation of the ROCK algorithm. What changes would be required to the BIRCH algorithm to implement it with the use of the Mahalanobis distance, to compute distances between data points and centroids? The diameter of a cluster is computed as its RMS Mahalanobis radius. Discuss the connection between high-dimensional clustering algorithms, such as PRO-CLUS and ORCLUS, and wrapper models for feature selection. Show how to create an implementation of the cluster feature vector that allows the incremental computation of the covariance matrix of the cluster. Use this to create an incremental and scalable version of the Mahalanobis k-means algorithm. Implement the k-means algorithm, with an option of selecting any of the points from the original data as seeds. Apply the approach to the quantitative attributes of the data set in Exercise 1, and select one data point from each class as a seed. Compute the cluster purity with respect to an implementation that uses random seeds. Describe an automated way to determine whether a set of “must-link” and “cannot-link” constraints are consistent. Chapter 8 Outlier Analysis “You are unique, and if that is not fulfilled, then something has been lost.”—Martha Graham 8.1 Introduction An outlier is a data point that is very different from most of the remaining data. Hawkins formally defined the notion of an outlier as follows: “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.” Outliers can be viewed as a complementary concept to that of clusters. While clustering attempts to determine groups of data points that are similar, outliers are individual data points that are different from the remaining data. Outliers are also referred to as abnor-malities, discordants, deviants, or anomalies in the data mining and statistics literature. Outliers have numerous applications in many data mining scenarios: Yüklə 17,13 Mb. Dostları ilə paylaş: |