Breadth-first approach: The ranking of the data points returned by the different algo-rithms is used for combination purposes. The top-ranked outliers over all the different algorithm executions are ranked first, followed by the second-ranked outliers (with repetitions removed), and so on. Minor variations could exist because of tie-breaking between the outliers within a particular rank. This is equivalent to using the best rank for each data point over all executions as the final outlier score.
Cumulative sum approach: The outlier scores over the different algorithm executions are summed up. The top-ranked outliers are reported on this basis.
At first sight, it seems that the random subspace sampling approach does not attempt to optimize the discovery of subspaces to finding rare instances at all. However, the idea here is that it is often hard to discover the rare subspaces anyway, even with the use of heuristic optimization methods. The robustness resulting from multiple subspace sampling is clearly a very desirable quality of the approach. Such methods are common in high-dimensional analysis where multiple subspaces are sampled for greater robustness. This is related to the field of ensemble analysis, which will be discussed in the next section.
9.4 Outlier Ensembles
Several algorithms in outlier analysis, such as the high-dimensional methods discussed in the previous section, combine the scores from different executions of outlier detection algo-rithms. These algorithms can be viewed as different forms of ensemble analysis. Some exam-ples are enumerated below:
Parameter tuning in LOF: Parameter tuning in the LOF algorithm (cf. Sect. 8.5.2.1 of Chap. 8) can be viewed as a form of ensemble analysis. This is because the algorithm is executed over different values of the neighborhood size k, and the highest LOF score over each data point is selected. As a result, a more robust ensemble model is constructed. In fact, many parameter-tuning algorithms in outlier analysis can be viewed as ensemble methods.
Random subspace sampling: The random subspace sampling method applies the approach to multiple random subspaces of the data to determine the outlier scores as a combination function of the original scores. Even the evolutionary high-dimensional outlier detection algorithm can be shown to be an ensemble with a maximization combination function.
Ensemble analysis is a popular approach in many data mining problems such as clustering, classification, and outlier detection. On the other hand, ensemble analysis is not quite well studied in the context of outlier analysis. Furthermore, ensemble analysis is particularly important in the context of outlier analysis because of the rare nature of outliers, and a corresponding possibility of overfitting. A typical outlier ensemble contains a number of different components:
9.4. OUTLIER ENSEMBLES
|
275
|
Model components: These are the individual methodologies or algorithms that are integrated to create an ensemble. For example, a random subspace sampling method combines many LOF algorithms that are each applied to different subspace projec-tions.
Normalization: Different methods may create outlier scores on very different scales. In some cases, the scores may be in ascending order. In others, the scores may be in descending order. In such cases, normalization is important for meaningfully combin-ing the scores, so that the scores from different components are roughly comparable.
Dostları ilə paylaş: |