Data Mining: The Textbook
Define the Mahalanobis-based extreme value measure when the d dimensions are sta-tistically independent of one another, in terms of the dimension-specific standard deviations σ1 . . . σd. Consider the four 2-dimensional data points (0, 0), (0, 1), (1, 0), and (100, 100). Plot them using mathematical software such as MATLAB. Which data point visually seems like an extreme value? Which data point is reported by the Mahalanobis measure as the strongest extreme value? Which data points are reported by depth-based measure? Implement the EM algorithm for clustering, and use it to implement a computation of the probabilistic outlier scores. Implement the Mahalanobis k-means algorithm, and use it to implement a compu-tation of the outlier score in terms of the local Mahalanobis distance to the closest cluster centroids. Discuss the connection between the algorithms implemented in Exercises 4 and 5. Discuss the advantages and disadvantages of clustering models over distance-based models. Implement a naive distance-based outlier detection algorithm with no pruning. What is the effect of the parameter k in k-nearest neighbor outlier detection? When do small values of k work well and when do larger values of k work well? Design an outlier detection approach with the use of the NMF method of Chap. 6. Discuss the relative effectiveness of pruning of distance-based algorithms in data sets which are (a) uniformly distributed, and (b) highly clustered with modest ambient noise and outliers. Implement the LOF-algorithm for outlier detection. Consider the set of 1-dimensional data points {1, 2, 2, 2, 2, 2, 6, 8, 10, 12, 14}. What are the data point(s) with the highest outlier score for a distance-based algorithm, using k = 2? What are the data points with highest outlier score using the LOF algorithm? Why the difference? Implement the instance-specific Mahalanobis method for outlier detection. Given a set of ground-truth labels, and outlier scores, implement a computer program to compute the ROC curve for a set of data points. Use the objective function criteria of various outlier detection algorithms to design corresponding internal validity measures. Discuss the bias in these measures towards favoring specific algorithms. Suppose that you construct a directed k-nearest neighbor graph from a data set. How can you use the degrees of the nodes to obtain an outlier score? What characteristics does this algorithm share with LOF? Chapter 9 Outlier Analysis: Advanced Concepts “If everyone is thinking alike, then somebody isn’t thinking.”—George S. Patton 9.1 Introduction Many scenarios for outlier analysis cannot be addressed with the use of the techniques discussed in the previous chapter. For example, the data type has a critical impact on the outlier detection algorithm. In order to use an outlier detection algorithm on categorical data, it may be necessary to change the distance function or the family of distributions used in expectation–maximization (EM) algorithms. In many cases, these changes are exactly analogous to those required in the context of the clustering problem. Other cases are more challenging. For example, when the data is very high dimensional, it is often difficult to apply outlier analysis because of the masking behavior of the noisy and irrelevant dimensions. In such cases, a new class of methods, referred to as subspace methods, needs to be used. In these methods, the outlier analysis is performed in lower dimensional projections of the data. In many cases, it is hard to discover these projections, and therefore results from multiple subspaces may need to be combined for better robustness. The combination of results from multiple models is more generally referred to as ensemble analysis. Ensemble analysis is also used for other data mining problems such as clustering and classification. In principle, ensemble analysis in outlier detection is analogous to that in data clustering or classification. However, in the case of outlier detection, ensemble analysis is especially challenging. This chapter will study the following three classes of challenging problems in outlier analysis: Yüklə 17,13 Mb. Dostları ilə paylaş: |