Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	125/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 121 122 123 124 125 126 127 128 ... 423

1-Data Mining tarjima

Diﬃcult clustering scenarios: Many data clustering scenarios are more challenging. These include the clustering of categorical data, high-dimensional data, and massive data. Discrete data are diﬃcult to cluster because of the challenges in distance com-putation, and in appropriately defining a “central” cluster representative from a set of categorical data points. In the high-dimensional case, many irrelevant dimensions may cause challenges for the clustering process. Finally, massive data sets are more diﬃcult for clustering due to scalability issues.

Advanced insights: Because the clustering problem is an unsupervised one, it is often diﬃcult to evaluate the quality of the underlying clusters in a meaningful way. This weakness of cluster validity methods was discussed in the previous chapter. Many alternative clusterings may exist, and it may be diﬃcult to evaluate their relative quality. There are many ways of improving application-specific relevance and robust-ness by using external supervision, human supervision, or meta-algorithms such as ensemble clustering that combine multiple clusterings of the data.

The diﬃcult clustering scenarios are typically caused by particular aspects of the data that make the analysis more challenging. These aspects are as follows:

C. C. Aggarwal, Data Mining: The Textbook, DOI 10.1007/978-3-319-14142-8 7

205

c Springer International Publishing Switzerland 2015

206 CHAPTER 7. CLUSTER ANALYSIS: ADVANCED CONCEPTS

Categorical data clustering: Categorical data sets are more challenging for cluster-ing because the notion of similarity is harder to define in such scenarios. Further-more, many intermediate steps in clustering algorithms, such as the determination of the mean of a cluster, are not quite as naturally defined for categorical data as for numeric data.

Scalable clustering: Many clustering algorithms require multiple passes over the data. This can create a challenge when the data are very large and resides on disk.

High-dimensional clustering: As discussed in Sect. 3.2.1.2 of Chap. 3, the computation of similarity between high-dimensional data points often does not reflect the intrinsic distance because of many irrelevant attributes and concentration eﬀects. Therefore, many methods have been designed that use projections to determine the clusters in relevant subsets of dimensions.

Because clustering is an unsupervised problem, the quality of the clusters may be diﬃcult to evaluate in many real scenarios. Furthermore, when the data are noisy, the quality may also be poor. Therefore, a variety of methods are used to either supervise the clustering, or gain advanced insights from the clustering process. These methods are as follows:

Semisupervised clustering: In some cases, partial information may be available about the underlying clusters. This information may be available in the form of labels or other external feedback. Such information can be used to greatly improve the clustering quality.

Interactive and visual clustering: In these cases, feedback from the user may be utilized to improve the quality of the clustering. In the case of clustering, this feedback is typically achieved with the help of visual interaction. For example, an interactive approach may explore the data in diﬀerent subspace projections and isolate the most relevant clusters.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 121 122 123 124 125 126 127 128 ... 423