9.4. OUTLIER ENSEMBLES
|
277
|
Algorithm IndependentEnsemble(Data Set: D
Base Algorithms: A1 . . . Ar)
begin
j = 1;
repeat
Pick an algorithm Qj ∈ {A1 . . . Ar};
Create a new data set fj (D) from D;
Apply Qj to fj (D);
until(termination);
return outliers based on combination of results from previous executions;
end
Figure 9.3: Independent ensemble framework
The broad principle of independent ensembles is that different ways of looking at the same problem provide more robust results that are not dependent on specific artifacts of a particular algorithm or data set. Independent ensembles are used commonly for parameter tuning of outlier detection algorithms. Another application is that of exploring outlier scores over multiple subspaces, and then providing the best result.
9.4.2 Categorization by Constituent Components
A second way of categorizing ensemble analysis algorithms is on the basis of their constituent components. In general, these two ways of categorization are orthogonal to one another, and an ensemble algorithm may be any of the four combinations created by these two forms of categorization.
Consider the case of parameter tuning in LOF and the case of subspace sampling in the feature bagging method. In the first case, each model is an application of the LOF model with a different parameter choice. Therefore, each component can itself be viewed as an outlier analysis model. On the other hand, in the random subspace method, the same algorithm is applied to a different selection (projection) of the data. In principle, it is possible to create an ensemble with both types of components, though this is rarely done in practice. Therefore, the categorization by component independence leads to either
model-centered ensembles, or data-centered ensembles.
9.4.2.1 Model-Centered Ensembles
Model-centered ensembles combine the outlier scores from different models built on the same data set. The example of parameter tuning for LOF can be considered a model-centered ensemble. Thus, it is an independent ensemble based on one form or categorization, and a model-centered ensemble, based on another.
One advantage of using LOF in an ensemble algorithm is that the scores are roughly comparable to one another. This may not be true for an arbitrary algorithm. For example, if the raw k -nearest neighbor distance were used, the parameter tuning ensemble would always favor larger values of k when using a combination function that picks the maximum
278 CHAPTER 9. OUTLIER ANALYSIS: ADVANCED CONCEPTS
value of the outlier score. This is because the scores across different components would not be comparable to one another. Therefore, it is crucial to use normalization during the combination process. This is an issue that will be discussed in some detail in Sect. 9.4.3.
9.4.2.2 Data-Centered Ensembles
In data-centered ensembles, different parts, samples, or functions of the data are explored to perform the analysis. A function of the data could include either a sample of the data (horizontal sample) or a relevant subspace (vertical sample). The random subspace sampling approach of the previous section is an example of a data-centered ensemble. More general functions of the data are also possible, though are rarely used. Each function of the data may provide different insights about a specific part of the data. This is the key to the success of the approach. It should be pointed out that a data-centered ensemble may also be considered a model- centered ensemble by incorporating a preprocessing phase that generates a specific function of the data as a part of the model.
9.4.3 Normalization and Combination
The final stage of ensemble analysis is to put together the scores derived from the different models. The major challenge in model combination arises when the scores across different models are not comparable with one another. For example, in a model-centered ensemble, if the different components of the model are heterogeneous, the scores will not be comparable to one another. A k-nearest neighbor outlier score is not comparable to an LOF score. Therefore, normalization is important. In this context, univariate extreme value analysis is required to convert scores to normalized values. Two methods of varying levels of complexity are possible:
The univariate extreme value analysis methods in Sect. 8.2.1 of Chap. 8 may be used. In this case, a Z-number may be computed for each data point. While such a model makes the normal distribution approximation, it still provides better scores than using raw values.
If more refined scores are desired, and some insights are available about “typical” distributions of outlier scores, then the mixture model of Sect. 6.5 in Chap. 6 may be used to generate probabilistically interpretable fit values. The bibliographic notes provide a specific example of one such method.
Another problem is that the ordering of the outlier scores may vary with the outlier detection algorithm (ensemble component) at hand. In some algorithms, high scores indicate greater outlierness, whereas the reverse is true in other algorithms. In the former case, the Z-number of the outlier score is computed, whereas in the latter case, the negative of the Z-number is computed. These two values are on the same scale and more easily comparable.
The final step is to combine the scores from the different ensemble components. The method of combination may, in general, depend upon the composition of the ensemble. For example, in sequential ensembles, the final outlier score may be the score from the last execution of the ensemble. However, in general, the scores from the different components are combined together with the use of a combination function. In the following, the conven-tion assumed is that higher (normalized) scores are indicative of greater abnormality. Two combination functions are particularly common.
Dostları ilə paylaş: |