Difficult clustering scenarios: Many data clustering scenarios are more challenging. These include the clustering of categorical data, high-dimensional data, and massive data. Discrete data are difficult to cluster because of the challenges in distance com-putation, and in appropriately defining a “central” cluster representative from a set of categorical data points. In the high-dimensional case, many irrelevant dimensions may cause challenges for the clustering process. Finally, massive data sets are more difficult for clustering due to scalability issues.
Advanced insights: Because the clustering problem is an unsupervised one, it is often difficult to evaluate the quality of the underlying clusters in a meaningful way. This weakness of cluster validity methods was discussed in the previous chapter. Many alternative clusterings may exist, and it may be difficult to evaluate their relative quality. There are many ways of improving application-specific relevance and robust-ness by using external supervision, human supervision, or meta-algorithms such as ensemble clustering that combine multiple clusterings of the data.
The difficult clustering scenarios are typically caused by particular aspects of the data that make the analysis more challenging. These aspects are as follows:
C. C. Aggarwal, Data Mining: The Textbook, DOI 10.1007/978-3-319-14142-8 7
|
205
|
c Springer International Publishing Switzerland 2015
206 CHAPTER 7. CLUSTER ANALYSIS: ADVANCED CONCEPTS
Categorical data clustering: Categorical data sets are more challenging for cluster-ing because the notion of similarity is harder to define in such scenarios. Further-more, many intermediate steps in clustering algorithms, such as the determination of the mean of a cluster, are not quite as naturally defined for categorical data as for numeric data.
Scalable clustering: Many clustering algorithms require multiple passes over the data. This can create a challenge when the data are very large and resides on disk.
High-dimensional clustering: As discussed in Sect. 3.2.1.2 of Chap. 3, the computation of similarity between high-dimensional data points often does not reflect the intrinsic distance because of many irrelevant attributes and concentration effects. Therefore, many methods have been designed that use projections to determine the clusters in relevant subsets of dimensions.
Because clustering is an unsupervised problem, the quality of the clusters may be difficult to evaluate in many real scenarios. Furthermore, when the data are noisy, the quality may also be poor. Therefore, a variety of methods are used to either supervise the clustering, or gain advanced insights from the clustering process. These methods are as follows:
Semisupervised clustering: In some cases, partial information may be available about the underlying clusters. This information may be available in the form of labels or other external feedback. Such information can be used to greatly improve the clustering quality.
Interactive and visual clustering: In these cases, feedback from the user may be utilized to improve the quality of the clustering. In the case of clustering, this feedback is typically achieved with the help of visual interaction. For example, an interactive approach may explore the data in different subspace projections and isolate the most relevant clusters.
Dostları ilə paylaş: |