Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	169/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 165 166 167 168 169 170 171 172 ... 423

1-Data Mining tarjima

Breadth-first approach: The ranking of the data points returned by the diﬀerent algo-rithms is used for combination purposes. The top-ranked outliers over all the diﬀerent algorithm executions are ranked first, followed by the second-ranked outliers (with repetitions removed), and so on. Minor variations could exist because of tie-breaking between the outliers within a particular rank. This is equivalent to using the best rank for each data point over all executions as the final outlier score.

Cumulative sum approach: The outlier scores over the diﬀerent algorithm executions are summed up. The top-ranked outliers are reported on this basis.

At first sight, it seems that the random subspace sampling approach does not attempt to optimize the discovery of subspaces to finding rare instances at all. However, the idea here is that it is often hard to discover the rare subspaces anyway, even with the use of heuristic optimization methods. The robustness resulting from multiple subspace sampling is clearly a very desirable quality of the approach. Such methods are common in high-dimensional analysis where multiple subspaces are sampled for greater robustness. This is related to the field of ensemble analysis, which will be discussed in the next section.

9.4 Outlier Ensembles

Several algorithms in outlier analysis, such as the high-dimensional methods discussed in the previous section, combine the scores from diﬀerent executions of outlier detection algo-rithms. These algorithms can be viewed as diﬀerent forms of ensemble analysis. Some exam-ples are enumerated below:

Parameter tuning in LOF: Parameter tuning in the LOF algorithm (cf. Sect. 8.5.2.1 of Chap. 8) can be viewed as a form of ensemble analysis. This is because the algorithm is executed over diﬀerent values of the neighborhood size k, and the highest LOF score over each data point is selected. As a result, a more robust ensemble model is constructed. In fact, many parameter-tuning algorithms in outlier analysis can be viewed as ensemble methods.

Random subspace sampling: The random subspace sampling method applies the approach to multiple random subspaces of the data to determine the outlier scores as a combination function of the original scores. Even the evolutionary high-dimensional outlier detection algorithm can be shown to be an ensemble with a maximization combination function.

Ensemble analysis is a popular approach in many data mining problems such as clustering, classification, and outlier detection. On the other hand, ensemble analysis is not quite well studied in the context of outlier analysis. Furthermore, ensemble analysis is particularly important in the context of outlier analysis because of the rare nature of outliers, and a corresponding possibility of overfitting. A typical outlier ensemble contains a number of diﬀerent components:

9.4. OUTLIER ENSEMBLES

275

Model components: These are the individual methodologies or algorithms that are integrated to create an ensemble. For example, a random subspace sampling method combines many LOF algorithms that are each applied to diﬀerent subspace projec-tions.

Normalization: Diﬀerent methods may create outlier scores on very diﬀerent scales. In some cases, the scores may be in ascending order. In others, the scores may be in descending order. In such cases, normalization is important for meaningfully combin-ing the scores, so that the scores from diﬀerent components are roughly comparable.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 165 166 167 168 169 170 171 172 ... 423