Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	173/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 169 170 171 172 173 174 175 176 ... 423

1-Data Mining tarjima

9.7. BIBLIOGRAPHIC NOTES

281

9.5.5 Biological and Medical Applications

Most of the data types produced in the biological data are complex data types. Such data types are studied in later chapters. Many diagnostic tools, such as sensor data and medical imaging, produce one or more complex data types. Some examples are as follows:

Many diagnostic tools used commonly in emergency rooms, such as electrocardiogram (ECG), are temporal sensor data. Unusual shapes in these readings may be used to make predictions.

Medical imaging applications are able to store 2-dimensional and 3-dimensional spa-tial representations of various tissues. Examples include magnetic resonance imaging (MRI) and computerized axial tomography (CAT) scans. These representations may be utilized to determine anomalous conditions.

Genetic data are represented in the form of discrete sequences. Unusual mutations are indicative of specific diseases, the determination of which are useful for diagnostic and research purposes.

Most of the aforementioned applications relate to the complex data types, and are discussed in detail later in this book.

9.5.6 Earth Science Applications

Anomaly detection is useful in detecting anomalies in earth science applications such as the unusual variations of temperature and pressure in the environment. These variations can be used to detect unusual changes in the climate, or important events, such as the detection of hurricanes. Another interesting application is that of determining land cover anomalies, where interesting changes in the forest cover patterns are determined with the use of outlier analysis methods. Such applications typically require the use of spatial outlier detection methods, which are discussed in Chap. 16.

9.6 Summary

Outlier detection methods can be generalized to categorical data with the use of simi-lar methodologies that are used for cluster analysis. Typically, it requires a change in the mixture model for probabilistic models, and a change in the distance function for distance-based models. High- dimensional outlier detection is a particularly diﬃcult case because of the large number of irrelevant attributes that interfere with the outlier detection pro-cess. Therefore, subspace methods need to be designed. Many of the subspace exploration methods use insights from multiple views of the data to determine outliers. Most high-dimensional methods are ensemble methods. Ensemble methods can be applied beyond high-dimensional scenarios to applications such as parameter tuning. Outlier analysis has numerous applications to diverse domains, such as fault detection, financial fraud, Web log analytics, medical applications, and earth science. Many of these applications are based on complex data types, which are discussed in later chapters.

9.7 Bibliographic Notes

A mixture model algorithm for outlier detection in categorical data is proposed in [518]. This algorithm is also able to address mixed data types with the use of a joint mixture model between quantitative and categorical attributes. Any of the categorical data clustering

282 CHAPTER 9. OUTLIER ANALYSIS: ADVANCED CONCEPTS

methods discussed in Chap. 7 can be applied to outlier analysis as well. Popular clustering algorithms include k-modes [135, 278], ROCK [238], CACTUS [220], LIMBO [75], and STIRR [229]. Distance-based outlier detection methods require the redesign of the distance function. Distance functions for categorical data are discussed in [104 , 182]. In particular, the work in [ 104] explores categorical distance functions in the context of the outlier detection problem. A detailed description of outlier detection algorithms for categorical data may be found in [5].

Subspace outlier detection explores the eﬀectiveness issue of outlier analysis, and was first proposed in [46]. In the context of high-dimensional data, there are two distinct lines of research, one of which investigates the eﬃciency of high-dimensional outlier detec-tion [66, 501], and the other investigates the more fundamental issue of the eﬀectiveness of high-dimensional outlier detection [46]. The masking behavior of the noisy and irrel-evant dimensions was discussed by Aggarwal and Yu [46]. The eﬃciency-based methods often design more eﬀective indexes, which are tuned toward determining nearest neighbors, and pruning more eﬃciently for distance-based algorithms. The random subspace sampling method discussed in this book was proposed in [334]. An isolation-forest approach was pro-posed in [365]. A number of ranking methods for subspace outlier exploration have been proposed in [396, 397]. In these methods, outliers are determined in multiple subspaces of the data. Diﬀerent subspaces may provide information either about diﬀerent outliers or about the same outliers. Therefore, the goal is to combine the information from these dif-ferent subspaces in a robust way to report the final set of outliers. The OUTRES algorithm proposed in [396] uses recursive subspace exploration to determine all the subspaces relevant to a particular data point. The outlier scores from these diﬀerent subspaces are combined to provide a final value. A more recent method for using multiple views of the data for subspace outlier detection is proposed in [397].

Recently, the problem of outlier detection has also been studied in the context of dynamic data and data streams. The SPOT approach was proposed in [546], which is able to deter-mine projected outliers from high-dimensional data streams. This approach employs a window-based time model and decaying cell summaries to capture statistics from the data stream. A set of top sparse subspaces is obtained by a variety of supervised and unsupervised learning processes. These are used to detect the projected outliers. A multiobjective genetic algorithm is employed for finding outlying subspaces from training data. The problem of high-dimensional outlier detection has also been extended to other application-specific sce-narios such as astronomical data [265] and transaction data [264 ]. A detailed description of the high-dimensional case for outlier detection may be found in [5].

The problem of outlier ensembles is generally less well developed in the context of out-lier analysis, than in the context of problems such as clustering and classification. Many outlier ensemble methods, such the LOF method [109], do not explicitly state the ensemble component in their algorithms. The issue of score normalization has been studied in [223], and can be used for combining ensembles. A recent position paper has formalized the con-cept of outlier ensembles, and defined diﬀerent categories of outlier ensembles [24]. Because outlier detection problems are evaluated in a similar way to classification problems, most classification ensemble algorithms, such as diﬀerent variants of bagging/subsampling, will also improve outlier detection at least from a benchmarking perspective. While the results do reflect an improved quality of outliers in many cases, they should be interpreted with caution. Many recent subspace outlier detection methods [46, 396, 397] can also be consid-ered ensemble methods. The first algorithm on high-dimensional outlier detection [46] may also be considered an ensemble method. A detailed description of diﬀerent applications of outlier analysis may be found in the last chapter of [5].

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 169 170 171 172 173 174 175 176 ... 423